katholt / srst2

Short Read Sequence Typing for Bacterial Pathogens
Other
125 stars 65 forks source link

Problem with reading pileup using non-recommended versions of samtools & bowtie #24

Closed katholt closed 9 years ago

katholt commented 9 years ago

From: Paola Flórez de Sessions, PhD Genome Institute of Singapore Efficient Rapid Microbial Sequencing (GERMS) platform leader

Hello Kathryn,

I have been attempting to run srst2 on an mrsa dataset. The output files look really great and I would love to have that information for my dataset. However, I seem to have some trouble. While the program appears to run successfully my results.txt files is empty while the .pileup file is populated.

$ bsub -M 1024 -W 560 -n 1 /mnt/software/bin/srst2 --input_pe /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R1.fastq /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R2.fastq --forward _R1 --reverse _R2 --output WNB01231PA --gene_db resistance.fas –log

We had to hack the version of samtools and bowtie as our data czars didn’t want to load older versions. Below is the *.pileup file.

$ [desessions@pegasus pileup]$ head WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.pileup 347aph(3)-IIIaph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 1 A 157 ^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~.^~,^~,^~,^~,^~,^~,^~,^~,^~, BIIBIFI<IIIFIFFIFBBBFB<FFBFIFBFFFBIFIFIFIFIIBIIFIFIB7<IIFFFFFIFIBFFIBFFIII<FIIFIFIIIBIIFIIFFIIIIBIFF7IIFF7FIFIFFBIIFFFIIFIIIFIIIBFFFFFIFIIIIIFIIIFI<FBFFIIFFF 347aph(3)-IIIaph(3)-III4_aph(3)-III_1_M26832_M26832_aminoglycosides 2 T 158 ...................................................................................................................................................,,,,,,,,,^~.^~. BIIBIFIIIFFIFBIFBFFIBFIF<FIFF<FFBFIIFIFIFIIBIIFIIIFBFIIFIIFIIFIFBFBFFFIII<FIFFIIIFIBIIFIIFFIIIIBIFFBIIIFFFIFIFFFIIBFFIIFIIIFFIIBFFFIBIIIIIIIIIIIFIFFBBBIIFFB<B 347aph(3)-IIIaph(3)-III4_aph(3)-III_1_M26832_M26832_aminoglycosides 3 G 161 ...................................................................................................................................................,,,,,,,,,..^~.^~,^~, FIBFIFFBIFIFIFFIFFFFIFIFFFIIFFBFFFIIIFIIIFFIFIIFIFIF<FIIIIIFI<FIFFFIFFFIIIFIIFIIIIIFIIFIIFFIIIIBIFF<IIFIFBIIIFFFIIFFBIIFIIIFFIIFFFFF<IBIIIIIIIIIFIFFBFFIIFFFBBB<< 347aph(3)-IIIaph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 4 G 159 ...............................................................................................................................................,,,,,,,,,...,,^~.^~. FIBFBIIBIFIFIF<IFFFFIFIIFFIIFFFIIFBIIFIIIBIF<IIIIFIFFIIIIIFIFBI<FBFFFIIII<FIIIIFIIFIIBIFFFIIIIFFIFIIIIIFIIIIFFIIFFIIFIIIFIIIFFBBIIFIIIIIIIIIFIFFBFIIIFFFBBB<<BB 347aph(3)-IIIaph(3)-III4_aph(3)-III_1_M26832_M26832_aminoglycosides 5 C 168 ................................................................................................................................................,,,,,,,,,...,,..^~.^~.^~.^~.^~.^~.^~.^~. FIFFFIIIFIFIFFIFF7FIFIFFFFIFFFFBFBIIFIIIFIIBIIFIFIFFIIIIIIIB<IIF<FFIIII7FIIIIIIIBII7IIIFFIIIFIIF7IIIIIFIIIIFFIIBIFIIFIIIIFIIFFF<IBIFIIIIIIIIIFFIFBFIIIFFBFFBB<BBBBBBBBBB 347aph(3)-IIIaph(3)-III4_aph(3)-III_1_M26832_M26832_aminoglycosides 6 T 177 ..................................................................................................................................................,,,,,,,,,...,,..........^~.^~.^~.^~.^~.^~.^~. FIIFIII<IIIFIIFIFBFFIFIIFFFIFF7FFFBIIFIIIIFIBIIFIFIFFIIIIIIIB<I<IIFFIFIIIIIIBIIIIIFIIFIIIFFIIIFIII<IIFIIFIIIFFFIIFIIIIFIIIIIIIFBFBIBIBIIIIIIIIIFIFFFFIIIFFFFFFBBBBBBBBBBBBBBBBBBB 347aph(3)-IIIaph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 7 A 183 ....................................................................................................................................................,,,,,,,,,...,,.................^~.^~.^~.^~, FIIFFIIBIIIFIIBIFFFFFFIIFFFIFFFFIFFIIFIIIFIIFIIIIFIF7FIIIIIIIBFIBFIFFIIIIIBIII<IIIIIFIFFIFIIIIIIFFII<IIFFFFIIIIFFII7IIIIIIIIIIIIFFFFI<IFIIIIIIIIIFIIFBBFIIFFIFFFBBFFBBBBBBBBBBBBBBB???? 347aph(3)-IIIaph(3)-III4_aph(3)-III_1_M26832_M26832_aminoglycosides 8 A 184 ....................................................................................................................................................,,,,,,,,,...,,....................,^~. FIIFIIIFIIIIIIFIFFFFFFIIFFBIIFFIFFFIIFIIIFII7IIIIIIF<FIIIIIFIFFIFFIFFIFIIIBIIIFIIIII<IIFIIIIIIIIFFII<IIIIIFIIIIFFIFFIFIIIIIIIIIIFFIFIFIFIIIIIIIIIFIFFFFFIIFFIFFFF<FFFFFFFFFFBB<BBBBAAAA; 347aph(3)-IIIaph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 9 A 189 ....................................................................................................................................................,,,,,,,,,...,,....................,.^~.^~.^~.^~.^~, FIIFFIIBIIIIIIIIIFFFFFIFIFFIIFFFFFFIIIIIIIIIBIIIIIIFBIIIIIIIIBFIBBI<BIIIIIBIIIFIIIIIFIIIIIIIIIIIFFIIBIIFIFFIIIFFFIFFIFII<IIIIFIFFBFFIFIBIIIIIFIIIBIFFFBFIIIFFBFFF7FFFBFFFFFFFFFFFFFBBBB;===== 347aph(3)-IIIaph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 10 A 189 ..................................................................................................................................................,,,,,,,,,...,....................,.....,^~.^~.^~, FIIFIIIBIFIIII<IIFIFIFIIIFFIIFBBFFFFIIIIFFIIBIIIIIIFBIIIIFIFIBFIBFIBFIIIIIBIIIBIIIIFFIFIIFIIIIIIFFIIBIFIFIIIIFIFIFIFIIBIIIIFIIB>>>>>BBE

Then when I check the results it looks like this: $ [desessions@apollo1 Dec_1st_results]$ cat WNB012_31PAgenesresistance__results.txt Sample HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001

The default minimum coverage remains 90% so in theory I should have results, right? Do you have any idea what could be happening?

Ps I have attached the resistance.fas in case you want to have a look.

Best,

Paola Flórez de Sessions, PhD Genome Institute of Singapore Efficient Rapid Microbial Sequencing (GERMS) platform leader

katholt commented 9 years ago

Hi Paola,

I have posted this to the Issues section on the SRST2 github. Could you please reply there? https://github.com/katholt/srst2/issues/24

There are no immediately obvious problems with that pileup file. But the problem could be due to you using incompatible versions of samtools and bowtie2. I strongly suggest you work with the tried and tested versions (samtools 0.1.18 and bowtie2 2.1.0) rather than newly released versions that have not been tested with srst2.

Have you looked to see if anything appears in the fullgenes output file? e.g. WNB012_31PAfullgenesresistance__results.txt

What does the log file show?

Kat

paolaflorez commented 9 years ago

Hello Kathryn,

First of all thank you for your quick response. Below is my log file: [desessions@apollo1 log_files]$ cat WNB012_31PA.log 12/01/2014 16:06:36 program started 12/01/2014 16:06:36 command line: /mnt/software/bin/srst2 --input_pe /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R1.fastq /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R2.fastq --forward _R1 --reverse _R2 --output WNB012_31PA --gene_db resistance.fas --log 12/01/2014 16:06:36 Total paired readsets found:1 12/01/2014 16:06:41 Index for resistance.fas is already built... 12/01/2014 16:06:42 Processing database resistance.fas 12/01/2014 16:06:42 Non-unique:375, spc2 12/01/2014 16:06:42 Non-unique:183, sul3 12/01/2014 16:06:42 Non-unique:183, sul3 12/01/2014 16:06:42 Processing sample HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001 12/01/2014 16:06:42 Output prefix set to: WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance 12/01/2014 16:06:42 Aligning reads to index resistance.fas using bowtie2... 12/01/2014 16:06:42 Running: bowtie2 -1 /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R1.fastq -2 /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R2.fastq -S WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.sam -q --very-sensitive-local --no-unal -a -x resistance.fas 12/01/2014 16:21:38 Processing Bowtie2 output with SAMtools... 12/01/2014 16:21:38 Generate and sort BAM file... 12/01/2014 16:21:38 Running: samtools view -b -o WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.unsorted.bam -q 1 -S WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.sam.mod 12/01/2014 16:22:16 Running: samtools sort WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.unsorted.bam WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.sorted 12/01/2014 16:22:30 Deleting sam and bam files that are not longer needed... 12/01/2014 16:22:30 Deleting WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.sam 12/01/2014 16:22:30 Deleting WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.sam.mod 12/01/2014 16:22:30 Deleting WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.unsorted.bam 12/01/2014 16:22:30 Generate pileup... 12/01/2014 16:22:30 Running: samtools mpileup -L 1000 -f resistance.fas -Q 20 -q 1 WNB012_31PAHS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.sorted.bam 12/01/2014 16:22:56 Processing SAMtools pileup... 12/01/2014 16:23:18 Scoring alleles... 12/01/2014 16:23:18 Finished processing for read set HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001 ... 12/01/2014 16:23:18 Tabulating results for database resistance.fas ... 12/01/2014 16:23:18 Finished processing for database resistance.fas ... 12/01/2014 16:23:18 Gene detection output printed to WNB012_31PAgenesresistance__results.txt 12/01/2014 16:23:18 SRST2 has finished.

Our load sharing facility also throws up a message when the job is done; below is the message for this particular sample:

Job </mnt/software/bin/srst2 --input_pe /mnt/pnsg10projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_R1.fastq /mnt/pnsg10projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_R2.fastq --forward _R1 --reverse _R2 --output WNB012_31PA --gene_db resistance.fas --log > was submitted from host by user in cluster . Job was executed on host(s) , in queue , as user in cluster . </home/desessions> was used as the home directory. </mnt/pnsg10_projects/desessions/mrsa/SRST2> was used as the working directory. Started at Mon Dec 1 16:03:16 2014 Results reported at Mon Dec 1 16:23:20 2014

Your job looked like:


LSBATCH: User input

/mnt/software/bin/srst2 --input_pe /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_R1.fastq /mnt/pnsg10projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149R2.fastq --forward _R1 --reverse _R2 --output WNB012_31PA --gene_db resistance.fas --log

Successfully completed.

Resource usage summary:

CPU time   :    664.37 sec.
Max Memory :       144 MB
Max Swap   :       718 MB

Max Processes  :         5
Max Threads    :         6

The output (if any) follows:

8041332 reads; of these: 8041332 (100.00%) were paired; of these: 8002262 (99.51%) aligned concordantly 0 times 14289 (0.18%) aligned concordantly exactly 1 time 24781 (0.31%) aligned concordantly >1 times

8002262 pairs aligned concordantly 0 times; of these:
  128 (0.00%) aligned discordantly 1 time
----
8002134 pairs aligned 0 times concordantly or discordantly; of these:
  16004268 mates make up the pairs; of these:
    15997993 (99.96%) aligned 0 times
    2963 (0.02%) aligned exactly 1 time
    3312 (0.02%) aligned >1 times

0.53% overall alignment rate [samopen] SAM header is present: 1913 sequences. [mpileup] 1 samples in 1 input files

Set max per-file depth to 8000 Starting mapping with bowtie2 0.53% alignment is about what you would expect, right? Some were as high as 2%. My assembly stats looked like this: SampleID PairedReads ReadLength GenomeSize Coverage NumContigs Avg_len Max_len N50 NumN50 N90 NumN90 ClosestRef ./all_summary.txt//WNB012_31PA 8,041,332 75 3022966 199 94 32159 280371 87120 10 22580 35 /mnt/genomeDB/ncbi/genomes/Bacteria/Carnobacterium_maltaromaticum_LMA28_uid179370/NC_019425.fna Best, Paola
katholt commented 9 years ago

Hi Paola, it looks like a reasonable amount of reads to be aligned to resistance genes. I'm not sure what the problem is, can you show me the fullgenes table?

paolaflorez commented 9 years ago

Hello!

I'm not too sure what you mean by the fullgenes table.

This is what my directory looks like: [desessions@apollo1 SRST2]$ ls arcc.tfa glpf.tfa resistance.fas resistance.fas.3.bt2 resistance.fas.rev.1.bt2 SRST2_bash.txt Staphylococcus_aureus.fasta.2.bt2 Staphylococcusaureus.fasta.fai tpi.tfa aroe.tfa gmk_.tfa resistance.fas.1.bt2 resistance.fas.4.bt2 resistance.fas.rev.2.bt2 Staphylococcus_aureus.fasta Staphylococcus_aureus.fasta.3.bt2 Staphylococcus_aureus.fasta.rev.1.bt2 yqil.tfa Dec_1stresults pta.tfa resistance.fas.2.bt2 resistance.fas.fai saureus.txt Staphylococcus_aureus.fasta.1.bt2 Staphylococcus_aureus.fasta.4.bt2 Staphylococcus_aureus.fasta.rev.2.bt2 Would you like to see any of those files?

Inside Dec 1st results: [desessions@apollo1 Dec_1st_results]$ ls bam_sam log_files pileup WSB2082_27Ngenesresistanceresults.txt WSB2141_60PAgenesresistance__results.txt WSB2200_113Tgenesresistanceresults.txt WSB2259_265Ngenesresistance__results.txt ... and all 288 results. I have attached the results for the one example we are looking at so you can take a look.

BTW Elita Jauneikaite was my student and she was the one that referred me to your program. She is defending her thesis this Friday and says hello!

Best, Paola

On Wed, Dec 3, 2014 at 11:17 AM, Kat Holt notifications@github.com wrote:

Hi Paola, it looks like a reasonable amount of reads to be aligned to resistance genes. I'm not sure what the problem is, can you show me the fullgenes table?

— Reply to this email directly or view it on GitHub.

Paola Flórez de Sessions, PhD _G_enome Institute of Singapore _E_fficient _R_apid _M_icrobial _S_equencing (GERMS) platform leader Sample HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001

katholt commented 9 years ago

SRST2 also outputs a fullgenes table, with file names like: WSB2082_27Nfullgenesresistance__results.txt

This happens by default in v0.1.4, unless you turn it OFF using --no_gene_details.

If you haven't switched it off but no fullgenes file is being created, it means that no scores are being returned. This could be a problem with parsing the pileup file... are you getting results OK when you do a MLST run?

paolaflorez commented 9 years ago

Hi, I see, no I didn't turn off the full genes table and yet it was definitely not generated. I will try to convenience out data czars to allow me to instal a local version of samtools and bowtie in my home directory.

I did try the MLST run and it was not successful, see the error message below from our load sharing facility. It seemed to complain that the --gene_db Staphylococcus_aureus.fasta was in the wrong format.

Job </mnt/software/bin/srst2 --input_pe /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R1.fastq /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R2.fastq --forward _R1 --reverse _R2 --output WNB01231PA --gene_db Staphylococcus_aureus.fasta --log> was submitted from host by user in cluster .

Job was executed on host(s) , in queue , as user in cluster .

</home/desessions> was used as the home directory.

</mnt/pnsg10_projects/desessions/mrsa/SRST2> was used as the working directory.

Started at Fri Nov 28 11:32:23 2014

Results reported at Fri Nov 28 12:19:47 2014

Your job looked like:


LSBATCH: User input

/mnt/software/bin/srst2 --input_pe /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R1.fastq /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R2.fastq --forward _R1 --reverse _R2 --output WNB01231PA --gene_db Staphylococcus_aureus.fasta --log


Exited with exit code 1.

Resource usage summary:

CPU time   :   2901.41 sec.

Max Memory :       845 MB

Max Swap   :      1356 MB

Max Processes  :         5

Max Threads    :         6

The output (if any) follows:

8041332 reads; of these:

8041332 (100.00%) were paired; of these:

8032673 (99.89%) aligned concordantly 0 times

13 (0.00%) aligned concordantly exactly 1 time

8646 (0.11%) aligned concordantly >1 times

----

8032673 pairs aligned concordantly 0 times; of these:

  0 (0.00%) aligned discordantly 1 time

----

8032673 pairs aligned 0 times concordantly or discordantly; of these:

  16065346 mates make up the pairs; of these:

    16061335 (99.98%) aligned 0 times

    14 (0.00%) aligned exactly 1 time

    3997 (0.02%) aligned >1 times

0.13% overall alignment rate

[samopen] SAM header is present: 2395 sequences.

[bam_sort_core] merging from 3 files...

[mpileup] 1 samples in 1 input files

Set max per-file depth to 8000 Starting mapping with bowtie2 Traceback (most recent call last): File "/mnt/software/bin/srst2", line 9, in ``` load_entry_point('srst2==0.1.4', 'console_scripts', 'srst2')() ``` File "/mnt/software/lib/python2.7/site-packages/srst2/srst2.py", line 1504, in main ``` db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes") ``` File "/mnt/software/lib/python2.7/site-packages/srst2/srst2.py", line 1073, in run_srst2 ``` db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta) ``` File "/mnt/software/lib/python2.7/site-packages/srst2/srst2.py", line 1135, in process_fasta_db ``` unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch) ``` File "/mnt/software/lib/python2.7/site-packages/srst2/srst2.py", line 1245, in map_fileSet_to_db ``` unique_gene_symbols, unique_allele_symbols, pileup_file) ``` File "/mnt/software/lib/python2.7/site-packages/srst2/srst2.py", line 759, in parse_scores ``` gene_name = get_allele_name_from_db(allele,unique_allele_symbols,unique_cluster_symbols,run_type,args)[2] # cluster ID ``` File "/mnt/software/lib/python2.7/site-packages/srst2/srst2.py", line 721, in get_allele_name_from_db ``` cluster_id = gene_name = allele_name = seqid = allele_parts[0] ``` IndexError: list index out of range Best, Paola
paolaflorez commented 9 years ago

Hello again,

Do you know if the software comes with a test data-set on which we can simply confirm that there's no error on our site?

paolaflorez commented 9 years ago

Hi Kat,

Thanks for all your help. I have finally sorted the results out for SRST2. I followed the updated example data as you suggested and it worked. I think the only big difference is that initially I was using a resistance.txt rather than the file called ARGannot.fasta. To summarize SRST2 works with the latest version of Samtools and bow tie. (I can get those exact version numbers if you are interested). As a bonus I went ahead and ran the MLST option too so now I have more data than I know what do with. This is a good problem. This was actually a really good exercise for me since I'm a biologist by training your program is really powerful and congrats again on your publication. We will be sure to site it if we ever get to publication on our side.

All the best!