lmrodriguezr / nonpareil

Estimate metagenomic coverage and sequence diversity
http://enve-omics.ce.gatech.edu/nonpareil/
Other
42 stars 11 forks source link

Read length fatal error #37

Closed sturne29 closed 4 years ago

sturne29 commented 5 years ago

Hello! I'm trying to use the software to estimate the coverage of paired-end metagenomes generated on the NovaSeq platform and many of the samples give me the same error.

Command (in folder containing files):

for f in *_L002_R1_001.fastq; do nonpareil -s $f -T kmer -f fastq -b /nonpareil_out/${f%.*}; done

Error:

Fatal error: Reads are required to have a minimum length of kmer size

My reads are approximately ~250 bp in length (verified this during my quality-checking and preprocessing steps), so I'm not sure why I'm getting this error. I'm only using one of the pairs and am using default parameters. Weirdly, some files that seem to have basically the same properties finish without issue. Any ideas why this might be happening? I've used this effectively with other samples and it's been very helpful, so I'm hoping to be able to use it with these samples as well.

Thanks so much in advance!

mldillon-LBL commented 4 years ago

Hi, I'm having this same issue.

nonpareil -s ../Vault_Tunnel_q30/Vault_Tunnel_q30.fastq -T kmer -f fastq -b Vault_Tunnel_nonpareil

@sturne29 did you find a solution?

Thanks!

sturne29 commented 4 years ago

Unfortunately, no, and haven't really had time to try much of anything myself. Still definitely interested in an answer! I'll share mine if I happen to get a chance to try some stuff and stumble on something that works.

lmrodriguezr commented 4 years ago

Hello, Any chance one or more entries in the FastQ are empty or shorter than 21bp? If possible, would you please share a FastQ file producing this error?

Thanks! Miguel.

sturne29 commented 4 years ago

I think my preprocessing steps should have removed any reads that short, but I'll give it another check to make sure. My files are mostly pretty large (> 2 GB), so I don't know if that's feasible for sharing, but I'll see what I can do.

mldillon-LBL commented 4 years ago

I thought this might be the problem, so I removed short reads using bbduk. It did remove 13 reads, but I got the same error. I tried again on a smaller file with the same result (note, the suffix says paired, but they're the f reads only):

$ bbduk.sh in=../Vault_Tunnel_q30/RM-CC-RM2-VT-3-3s_S4_L008_R1_001_q30.fastq.paired.fq out=RM-CC-RM2-VT-3-3s_S4_L008_R1_001_q30_21.fastq minlen=21

BBDuk version 36.92 NOTE: No reference files specified, no trimming mode, no min avg quality, no histograms - read sequences will not be changed. Initial: Memory: max=158257m, free=152477m, used=5780m

Input is being processed as unpaired Started output streams: 0.064 seconds. Processing time: 3.481 seconds.

Input: 3324285 reads 282973798 bases. Total Removed: 3 reads (0.00%) 45 bases (0.00%) Result: 3324282 reads (100.00%) 282973753 bases (100.00%)

Time: 3.559 seconds. Reads Processed: 3324k 934.12k reads/sec Bases Processed: 282m 79.52m bases/sec

$ nonpareil -s RM-CC-RM2-VT-3-3s_S4_L008_R1_001_q30_21.fastq -T kmer -f fastq -b RM-CC-RM2-VT-3-3s_S4_L008_R1_001_q30_21.nonpareil

Nonpareil v3.303 [ 0.0] reading RM-CC-RM2-VT-3-3s_S4_L008_R1_001_q30_21.fastq [ 0.0] Picking 10000 random sequences [ 0.0] Started counting Fatal error: Reads are required to have a minimum length of kmer size [ 0.1] Fatal error: Reads are required to have a minimum length of kmer size

Unfortunately, even compressed the smallest fastq I have is larger than the 10 MB upload limit here.

sturne29 commented 4 years ago

Positive update!

Initially, I also used the bbduk.sh script to remove reads shorter than 21 bp, but still had the same fatal error regarding minimum length in some files. I then wondered about what would happen if I removed short reads longer than 21 bp, and I decided to try it out.

Using bbduk.sh again, I removed any reads shorter than 50 bases, and everything looks like it's working great. I decided to err on the side of caution in choosing 50 bp as the cutoff, so I don't know if there's a better choice that preserves more reads. (I only lost ~0.01% of reads per file during the removal process, so I don't plan on experimenting further in this case.)

Thanks for the input! It looks like the issue was probably on my end, because my initial processing did not actually remove all short reads.

@mldillon-LBL Maybe you could try something similar? I wonder if removing anything shorter than, say, 25 bases might work.

lmrodriguezr commented 4 years ago

Thanks @sturne29 !