lh3 / seqtk

Toolkit for processing sequences in FASTA/Q formats
MIT License
1.37k stars 308 forks source link

Sample procudes different numbers of reads despite seed being set #96

Closed hhollandmoritz closed 6 years ago

hhollandmoritz commented 7 years ago

Hi,

I'm trying to use seqtk to sub-sample my paired-end fastq files but keep running into the issue that sub-sampling produces different numbers of reads. This happens whether or not I specify the desired number of reads as a fraction or integer. The files I'm interested in contain 8512483 reads each. Subsampling at 25 percent should give roughly 2128121 reads (when rounded up). However running

seqtk sample -s 100 "$read_prefix"/trimmedAdpt."$sample".R1.fastq 2128121 > "$read_prefix"/trimmedAdpt."$sample".ss_"$subsamp".R1.fastq seqtk sample -s 100 "$read_prefix"/trimmedAdpt."$sample".R2.fastq 2128121 > "$read_prefix"/trimmedAdpt."$sample".ss_"$subsamp".R2.fastq

yields 2127079 reads for R1 and 2127083 for R2. Similarly, running

seqtk sample -s 100 "$read_prefix"/trimmedAdpt."$sample".R1.fastq 0.25 > "$read_prefix"/trimmedAdpt."$sample".ss_"$subsamp".R1.fastq seqtk sample -s 100 "$read_prefix"/trimmedAdpt."$sample".R2.fastq 0.25 > "$read_prefix"/trimmedAdpt."$sample".ss_"$subsamp".R2.fastq

yields 2125234 for R1 and 2125258 for R2. From a small survey of reads in the two subsampled files, they seem to be correctly paired.

Any insights into how to fix this would be welcome. Thanks, Hannah

shenwei356 commented 7 years ago

How do you count reads?

# not precise
grep '^@' xxx.fastq | wc -l  

This may be the problem, cause quality line would be starting with @

hhollandmoritz commented 7 years ago

Hi,

Thanks for the speedy reply! :)

I looked at the headers of each fastq read and used the common string designating the machine it was sequenced on for the pattern match.

fgrep -c "@HISEQ07" *.fastq

I'm fairly certain it is accurate since I've also verified it on the original file with a wc -l divided by four.

But to be thorough I just went back and counted with wc -l on the processed samples as well and may have found a clue about what might be going wrong. The word counts are not evenly divisible by 4, (though they are even numbers) could something be dropping parts of reads for some reason? I also went back and counted for lines that start and end with a "+" in the sub-sampled reads I get the same answer as the grep for headers.

grep -c "^+$" *.fastq

gives: Unsubsampled R1: 8512483 Unsubsampled R2: 8512483 0.25 % subsampled R1 : 2125234 0.25 % subsampled R2 :2125258

For reference: wc -l *.fastq

gives: Unsubsampled R1: 34049932 Unsubsampled R1: 34049932 0.25 % subsampled R1: 8512686 0.25 % subsampled R2: 8512734

Thanks!

shenwei356 commented 7 years ago

Can't fingure out if it is a bug or due to reads file without access to the files.

I don't know whether introducing another tool is OK here. Hope you don't mind, Heng.

You can use seqkit to help find the cause. Use "seqkit stats" to count reads and lengths, and also use "seqkit sample" for sampling by number or fraction (prefered).

hhollandmoritz commented 7 years ago

Cool, thanks for the suggestions, I'll try that!

lh3 commented 7 years ago

@shenwei356 seqkit is great. I always welcome suggestions on better tools.

@hhollandmoritz It seems that your fastq is corrupted somewhere. Unfortunately, seqtk is not robust to such errors. Please try seqkit as @shenwei356 has suggested.

hhollandmoritz commented 7 years ago

Thanks both of you! Heng Li was correct. My fastqs were corrupted from an apparently over-zealous adapter-trimming pipeline. After fixing the fastqs seqtk works like a charm. Thank you for your help with troubleshooting.

tseemann commented 6 years ago

@hhollandmoritz can you close this issue please?

hhollandmoritz commented 6 years ago

Yep, sorry about that.