yacrd for detecting chimeras in amplicon sequences

Robvh-git commented 1 year ago

Hi,

I want to detect chimeras on 16S nanopore data, similar to this post I've tried vsearch now, but as vsearch was developed for high quality short reads, I think a lot of false positive chimeric sequences are found.

@natir in that post you state "If a read has a poor-quality region in the middle, it's considered chimeric.". But - if I'm not mistaken - this does not lead to correct chimera detection of amplicon chimeras? In amplicon sequencing the error profile (i.e. the poor-quality region) is not related to the read being chimeric or not.

So can I state correctly that yacrd is not suitable for chimera detection of amplicon nanopore data?

natir commented 1 year ago

Indeed, yacrd has not been designed, tested or even optimized for this type of data, so if the publication or the readme suggests it, I'm sorry.

Then it's possible that yacrd can detect chimeras in this situation, but it would require a specific set of parameters. I'm willing to help you discover these parameters and put them forward in the readme. Modify yacrd if necessary/possible, even write a small publication if there's material.

You can use the pipelines I designed for the yacrd publication to search for these parameter values. But you'd need one or more truthful data sets to be able to determine these parameters, and I don't have them (and not enough time to devote to it, sorry).

About reads not sufficiently covered If a read is not sufficiently covered, we can't determine whether it's a chimera or not - there's not enough information. By default, if a read isn't covered for 40% of its length, it's considered to be not sufficiently covered. It's therefore possible that it's a chimera, but we can't determine this because there isn't enough information, but this 40% value is a modifiable parameter.

In my opinion, a reads that is not sufficiently covered is not usable and could even be a chimera. But yacrd cannot determine that it is chimeric because there is no information.

If you want we can have a talk to discuss about all of this (send me an e-mail to plan this if you want).

Robvh-git commented 1 year ago

Hi @natir

thanks for the quick and elaborate reply, much appreciated! No the publication/rea dme does not necessarily suggest it, but I think it could be good to add a little not that this pipeline is not developed/tested/optimized for amplicon data. For now I'll search for another option.

You mention the coverage of reads - as is mentioned in the readme - but how this relation to your statement "If a read has a poor-quality region in the middle, it's considered chimeric"? With "poor-quality" do you mean poor base quality, or poor coverage?

natir commented 1 year ago

yacrd is based on the idea that if a read is covered, its information is true because many other reads have same information. As a corollary, a read that is not covered is necessarily a read of poor quality.

natir / yacrd

yacrd for detecting chimeras in amplicon sequences #54