nextgenusfs / amptk

AMPtk: Amplicon ToolKit for NGS data (formally UFITS)
http://amptk.readthedocs.io/
BSD 2-Clause "Simplified" License
35 stars 14 forks source link

Illumina: Set Merge Options #53

Open Aciole-David opened 5 years ago

Aciole-David commented 5 years ago

Hello I'm testing amptk on some ITS data. The "amptk illumina" log shows some values like: ######################################### Merging reads 100% 107030 Pairs 8000 Merged (7.5%) 99030 Not merged (92.5%)

Pairs that failed merging due to various reasons: 1008 too few kmers found on same diagonal 389 multiple potential alignments 27692 too many differences 69941 alignment score too low, or score drop to high ############################################

Along all the reads, I got around 60-95 % Not merged. 1st question is: Is this right? Because after concatenating, the number of valid output reads looks like if most of the pairs were merged (i.e. there is ~25 milion merged reads from 50 milion raw paired-end reads); is this the "--rescue_forward" step?

2nd question is: Maybe is there any control option for the merging (overlap/diff) that I could make? The vsearch manual state that default parameters are 10 min bases overlap/10 max bases diff (If I understood it right).

Thanks!

nextgenusfs commented 5 years ago

What are your read lengths?

So the default method is to merge reads with usearch, if a pair does not merge then the forward read is rescued. I have typically been able to merge ~90-95% of reads - we have used 2 x 300 using a double indexed strategy and default illumina sequencing primers.

Per the code, the merging command for usearch/vsearch is here, you'll note that I have some somewhat customized calls to usearch, this is actually to increase the number of reads that are merged. https://github.com/nextgenusfs/amptk/blob/master/amptk/amptklib.py#L1180-L1183

As of right now there isn't the ability to customize the merging command in AMPtk. I guess my thought would be to try on a single sample several different parameters to usearch/vsearch and see if you can increase the merging. My guess might be that you have low quality reverse reads and they are getting clipped off and therefore not merged. If the quality score is below 5 it will get truncated before the alignment step. So you can also check your quality with some like fastqc or there are many other ways, but basically look at the quality scores near the 3' end of the reverse reads -- this is typically where some MiSeq runs fail (especially with 2 x 300). If you have low quality reverse reads, you might want to just use the forward reads, or allow the merging as above, but set your trim length to less than your read length, i.e. if you have 250 bp reads, then I'd set it to something like 230-240 so you keep as much data as you can.

Aciole-David commented 5 years ago

Thanks for the quick reply Our reads are also 2 x 300/double index/MiSeq We do have low Q around the R2 3'. Maybe adapters read-through too. I hope using forwards only do the trick I'll test your suggestion on trim and show the counts Thanks again