jordanlab / stringMLST

Fast k-mer based tool for multi locus sequence typing (MLST)
Other
42 stars 7 forks source link

some samples came back with traceback error or empty result #41

Closed florathecat closed 5 years ago

florathecat commented 5 years ago

Hi,

I just installed stringMLST through bioconda on my VM (Ubuntu 18.04 LTS on Windows 7 host) and ran it on a bunch of samples using E coli 1 MLST scheme. About 80% of the samples gave a ST and 3.5% returned 0, which were expected.

However, for the rest 16.5%, some returned an empty list and some gave a traceback error message. All of these problem samples were downloaded from NCBI. I had no problem running them in QC inspection or reference sequence alignment (bowtie2/samtools, etc.) so I know the files were not corrupted. Nevertheless, I did notice that most of the samples that returned an empty list were submitted from one source and contained contaminating reads from another serotype. I didn't observe any pattern for the ones that gave me a Traceback error other than the fastq.gz files might be on the small side (<50 MB each). I was wondering if there's an explanation, and better, a fix, for these samples.

Please see examples below (Ec is the prefix I gave E. coli MLST scheme 1 when I created DB):

ST = stringMLST.py --predict -1 DRR015930_1.fastq.gz -2 DRR015930_2.fastq.gz -P Ec print(ST)

[]

ST = stringMLST.py --predict -1 ERR1777574_1.fastq.gz -2 ERR1777574_2.fastq.gz -P Ec print(ST)

['Traceback (most recent call last):', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 1605, in ', ' results = singleSampleTool(fastq1, fastq2, paired, k, results)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 399, in singleSampleTool', ' singleFileTool(fastq1, k)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 452, in singleFileTool', ' fileExplorer(fastq, k, non_overlapping_window)', ' File "/home/florathecat/anaconda3/bin/stringMLST.py", line 468, in fileExplorer', ' lines = f.readlines()', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 289, in read1', ' return self._buffer.read1(size)', ' File "/home/florathecat/anaconda3/lib/python3.6/_compression.py", line 68, in readinto', ' data = self.read(len(byte_view))', ' File "/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 482, in read', ' raise EOFError("Compressed file ended before the "', 'EOFError: Compressed file ended before the end-of-stream marker was reached']

Thanks for your time.

ar0ch commented 5 years ago

Hi Flora,

I think the traceback is probably telling a lot of the story here:

"/home/florathecat/anaconda3/lib/python3.6/gzip.py", line 482, in read', ' raise EOFError("Compressed file ended before the "', 'EOFError: Compressed file ended before the end-of-stream marker was reached']

Many file handling utilities, including programs like FastQC and bowtie will happily process gzipped files that don't have an EOF. As long as the truncation occurs at the end of a read (e.g. a multiple of 4), these tools won't even throw an error or warning. Python's file handler however doesn't like truncated gzipped files.

If you take a look at DRR015930 over in the SRA runbrowser, you'll see the files should be ~3GB combined while compressed and ~15GB when ucompressed. Since you've said that the files giving you errors are pretty small, <50mb, combined with the traceback I'm guessing these files didn't download correctly.

fastq-dump was giving me issues this morning trying to download DRR015930 off of SRA and I had to resort to going over to DDBJ to actually grab the data. I downloaded the bz2 compressed files, and converted them over to gzipped files.

$ du -sB MB DRR015930_1.fastq DRR015930_1.fastq.gz DRR015930_2.fastq DRR015930_2.fastq.gz
7596MB  DRR015930_1.fastq
1738MB  DRR015930_1.fastq.gz
7596MB  DRR015930_2.fastq
1809MB  DRR015930_2.fastq.gz

$ stringMLST.py --getMLST --species ecoli1 -P Ec
        Database ready for escherichia coli#1 
        Ec
$ stringMLST.py --predict -P Ec/ecoli1 -1 DRR015930_1.fastq.gz -2 DRR015930_2.fastq.gz
Sample  adk     fumC    gyrB    icd     mdh     purA    recA    ST
DRR015930       2       79      59      2       2       2       2       260
florathecat commented 5 years ago

Hi Aroon,

Thanks! I didn't even realize that the files were corrupted. I downloaded the smaller files (ERR1777574) and indeed the files are bigger than the original ones that I had. The new files now gave me a ST. As to the large files (DRR015930) that gave me an empty result, fastq-dump gave me issues, too, so I am using DDBJ as you recommended.

Many Many Thanks and may you enjoy a great holiday season!

ar0ch commented 5 years ago

Great! Have a great end of year as well :)