roblanf / sangeranalyseR

functions to analyse sanger sequencing reads in R
MIT License
96 stars 24 forks source link

improper merge/contig #66

Open tomsauv opened 3 years ago

tomsauv commented 3 years ago

Hello,

Using trimming methods M1, my trimmed reads are too short to overlap and an erroneous contig is created (see below numerous degeneracies). Relaxing M1TrimmingCutoff to 0.0002, the merge is proper and the sequence longer (and no major degeneracies).

I verified that there is no similarity between the forward and reverse trimmed reads in the first case, why is a merge happening? none should happen?

Thanks

AB4-13_trimM1_0_0001 TACGGGAGGCMKCRRKGKSGAMKKTWYCGMWAWGSKYGAWAKCSWRWYSSWGCARSMCCGSRYGGGKGAGKAASGCTCTA GGGKTGWRAAYCYSYYTTCTTTGGGAAGAAGAASKKMYGGKACMAMAGRARKMARCYTCKGCTAAYYCCGKRYSAGCMGC CGCGGTAATACGGAGGAGGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTKKMRRTRWAAGWYTRMTKK CMWRAMSMWGRGCTCRMSTCTGRTYAGSYAGTKGAAACTACGTAGCTAGAGTYTGGTARRGGCAWASSRARKTCCCGGTG TAGCGGTGAAATGCGTAGAGATCGGGAAGAACATCGGTGGCGAAAGCGCTTTGCTAGACCAGAACTGACACTCAGGGACG AAAGCTAGGGGAGCGAATGGGATTAGATACCCCAGTAGTCGGGGGATCCACTAGTTCTAGAGCGGCCGCCACCGCGGTGG AGCTCCAGCT

roblanf commented 3 years ago

Hi @tomsauv,

Thanks for raising this. The merge happens because you've told sangeranalyseR to try and merge these two reads into a contig, so it will go ahead and attempt it. You're right though, we should think about a way to flag situations like this and warn the user.

Any chance you could post a screenshot of the alignment of the two reads and the consensus (you can get that from the shiny app) in the good and the bad case? That will definitely help us figure out a sensible way forwards. Ideally we need some kind of statistic to just say whether the reads are alignable at all, but I'm not aware of any such statistic. There must be a likelihood approach that would let us to a simple hypothesis test though, I'll have a think...

Rob

tomsauv commented 3 years ago

Hi Rob,

I positioned the forward trimmed sequence where degeneracies start on the contig and it corresponds (see screenshot). So the forward aligns itself at that spot.

Screen Shot 2021-01-26 at 17 55 22

Thomas

roblanf commented 3 years ago

@Kuanhao-Chao, we need to have a think about ways to identify, flag, (and maybe an option to disallow) these kinds of alignments.

Post any thoughts and ideas here!!

The simplest one I can think of would be to allow users to specify a maximum number or percentage of mismatches between the reads and the contig. For example, typically we might think that you'd get 1% mismatches through things like sequencing errors and/or ambiguities in the read. But certainly if you got 10% mismatches that would usually indicate a problem, and you might then want to exclude that read from the contig (and not build the contig at all in that case). If a read fails this filter, we should also spit a warning to suggest to users to do what @tomsauv did and play around with the trimming settings.

To go one further, we could add an option to auto-tune the trimming settings here. The idea there is that if this option is turned on, we start at whatever value the user suggests, but if a read fails because of the mismatch-to-contig filter, we then try to optimise the trimming value of the reads for that contig, and keep the reads if we can find a trimming value that leads to a legitimate conitg being built (i.e. enough overlap between reads, and few enough mismatches).

@tomsauv, what's your opinion? Would something like the simple filter (ignore the auto-tuning for now) work? E.g. a filter with a default of 10%, where a read is excluded if it has >10% mismatches to the assembled contig?

Rob

roblanf commented 3 years ago

just a small update - it occurs to me that any mismatch threshold should be calculated only for the bases where >1 read has contributed to the contig! And also, the threshold should obviously count as a mismatch something where the read is certain but the contig is ambiguous, but not if the read is ambiguous but the contig is not.

tomsauv commented 3 years ago

Hello,

A good approach could be to compute the % divergence of the overlaping segment and store it in your object (i.e. 100-((nt mismatch/nt in overlap)*100). Then you can look at the distribution of those values; improper merges should results (hopefully) in clear outliers as compared to proper merges. As you mentioned, for those improper merges you could then move to a different trimming setting to see if a better merge can be obtained, and if not eliminate/flag the contig as improper.

This approach may also allow you to spot chimeras. For instance in the batch I am processing, one sample has a Fwd read from organism A and the reverse read is of an organism B (contaminant that coamplified during PCR...). The % divergence of the overlap in such case may also be an outlier as compared to non-chimeric merges.

Another reason using a hard threshold could be an issue is that it may vary for different molecules, for e.g something very conserved like 16S vs. something more variable like COI...In this respect, user should probably only process batches of a same gene at a time(?)

Are you familiar with the package DADA for metabarcoding? it has an iteresting approach in which Fwd reads are cleaned/corrected together, and likewise for Rev reads. Only after this step the merge of paired-end reads is done. This could be a trimming approach to keep good data that may have low quality score at the end of the chromatograms and subsequently enhance the merges.

Regarding the generateReport function, I am not sure if it takes a very long time or if it is hanging, but I ended up stopping it.

Thomas