katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
125 stars 65 forks source link

Sample column blank in output__mlst__XYZ__results.txt #37

Closed complexgenome closed 9 years ago

complexgenome commented 9 years ago

Hello SRST2 Team, I'm running srsts2 on illumina reads.

I'm throwing my reads for Sequencing typing with below command:

sbatch -p 128gb --nodes=1 -J $i --time 12:0:0 -o $i".%j.out" -e $i".%j.stderr" ~/srst2-master/scripts/srst2.py --input_pe $forward_read $reverse_read --forward $forward_prefix --reverse $reverse_prefix --output $temp_dir/$i --log --save_scores --mlst_db Staphylococcus_aureus.fasta --mlst_definitions saureus.txt --mlst_delimiter '-'

$i is the sample/isolate name. --output $temp_dir/$i $tempdir - is the output directory, and the prefix is $i_ i.e sample name

I get an output like below, where I'm having blank value in column for Sample:

Sample ST arcc aroe glpf gmk_ pta_ tpi_ yqil mismatches uncertainty depth maxMAF

    121     6       5       6       2       7       14      5       0       -       132.005428571   0.0512820512821`

However, when I run on data ERR024070, as in example.txt bundled with package, I get output as expected, with value in Sample column.

I've tried multiple times, but it seems output won't listen to me.

Kindly advise.

Many thanks.

katholt commented 9 years ago

This is probably because you are specifying a subdirectory within your output prefix (--output $temp_dir/$i).

The argument --output is not intended to allow you to specify a subdirectory in which to write SRST2 output, and this probably won't work. The solution would be to implement explicit handling of subdirectories, so that we do things like check that the directory exists; if not then create it; then write all output files to this (except the log? or including the log?). This is worth doing but we won't get to it for a while unfortunately.

In the meantime, the thing to do is forget about using temp directories and just allow SRST2 to write to the location where it is run; ie just run with --output $i. I would just run everything in one dir, i.e. leave your code as is and drop the $temp_dir/. All files output by SRST2 for sample $i will then be named with the prefix $i__. You can sort them out into individual directories if you really want to.

Alternative is to handle creating and writing to directories in your bash script; ie mkdir $temp_dir, cd $temp_dir, srst2 --output $i [etc etc]

complexgenome commented 9 years ago

Hi Dr. Kat, Thank you for your reply to my query. I removed output path as suggested, tried and didn't get desired output. Finally with your guidance on issue 36 (https://github.com/katholt/srst2/issues/36) I played around on the way I was running; I got sample name in my output.

I was running incorrectly.

Incorrect way:

python srst2.py --input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz --forward 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001 --reverse 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001 --output _check_dir_ --log --save_scores --mlst_db Klebsiella_pneumoniae.fasta --mlst_definitions kpneumoniae.txt --mlst_delimiter '_'

Correct way: python srst2.py --input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz --forward _R1_001 --reverse _R2_001 --output _dir_ --log --save_scores --mlst_db Klebsiella_pneumoniae.fasta --mlst_definitions kpneumoniae.txt --mlst_delimiter '_'

Sample name isn't picked up if forward and reverse suffixes are provided like I tried.

--forward _R1_001 and --reverse _R2_001 were crucial to have required output. Many thanks.