katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
125 stars 65 forks source link

Allow forward and reverse flags as file name #36

Closed complexgenome closed 9 years ago

complexgenome commented 9 years ago

Hi SRST2 team,

Your tool is amazing, very neat and fast. I'm currently using SRST2 for my illumina data, paired end data.

I've files named as (example): 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz These were not accepted as
--input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005*.fastq.gz This was probably, as the naming standardized in srst2.py is different.

I was able to run as: python srst2.py --input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz --forward 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001 --reverse 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001

This (above case) was handy, since I'm playing around with tool, and data and can provide prefixes (manually). I'm afraid, things might go wrong, when working on ~1000 isolates, in providing prefixes as in above fashion.

Can something be done in script to overcome/ignore this? This seems redundant; provide paired end flag, read1 file name, read2 file name and prefixes for forward and reverse reads.

Something Like: python srst2.py --input_pe --forward ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz --reverse ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz

katholt commented 9 years ago

Hi Sareej,

The correct way to handle your reads would be:

--forward _R1_001 --reverse _R2_001 --input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz

See the example in the readme: https://github.com/katholt/srst2#paired-reads

In answer to your other question, we don't currently allow you to specify 'this is my forward read file, this is my second read file', because we have designed SRST2 to be able to accept multiple forward and reverse reads for different samples in the one command. Hence SRST2 tries to parse your set of input files and sort them into the right pairs.

So if you have lots of read sets (e.g. it looks like you are using barcoding on Illumina, so you will have lots from one run), you can simply do:

--forward _R1_001 --reverse _R2_001 --input_pe L005_R1_001.fastq.gz L005_R2_001.fastq.gz

and SRST2 will sort them out into pairs, run individually on each readset, and tabulate the results for you.

complexgenome commented 9 years ago

Appreciate your help. That worked.

Thanks much!