Make use of base quality score - Fastq support

lerch-a commented 8 years ago

Hi,

thanks a lot for this software. The concept is very convincing to me.

I just like to ask, why the base quality score is not used during the clustering and de-replication process? Is it not possible because of the algorithm or does it slow down the clustering process. What about to incorporate the phred score after the OTUs are finished e.g. as a consensus quality score?

Thanks.

frederic-mahe commented 8 years ago

Hi @nidlae,

I am preparing a general answer to that question, but the manuscript is not ready yet. In short, swarm is a denoising tool, but not the only one we should apply to our data.

In my opinion, clustering should be as far as possible a lossless process (same number of reads in and out). Quality filtering is a mean to reduce the size of datasets before clustering to speed-up computation. Since swarm is faster than other clustering methods, that's not a valid ordering of operations anymore.

Here is what I do for my data:

paired-ends assembly with vsearch,
demultiplexing and primer clipping with cutadapt,
fastq to fasta conversion, dereplication and quality values (i.e. expected error rates per sequence) with vsearch,
dereplication of the whole project (pool all samples) with vsearch,
clustering with swarm.

After that, I only work on OTU representatives:

taxonomic assignment with vsearch and custom scripts,
chimera detection with vsearch.

I then build an OTU table: OTUs vs samples, plus taxonomic assignment results, OTU dispersion in samples, chimeric status, and quality values. For each OTU representative, I search for the best expected error value divided by the length (a value of 0.0001 is ideal). In layman terms, I reject an OTU if I don't see at least once a copy of its representative sequence with a top quality (ee = 0.0001). In practice, I leave a small margin of error and keep OTUs with an ee value < 0.0002.

As you can see, all filtering is rejected after the clustering to retain potential signal as long as possible. Only at the very end OTUs are filtered, using different sources of data (quality, taxonomic assignment or not, chimera or not, presence of the OTU in several samples or not, etc.).

I hope it answers your question.

colinbrislawn commented 8 years ago

Thank you for taking the time to discuss your methods. Your perspective is fascinating.

How do you build your OTU table in such a way that it includes all that additional information about your swarm centroids? I know you describe this in the wiki and I would love to hear how you include more info.

Colin

lerch-a commented 8 years ago

Hi @frederic-mahe

yes, this answer my question beyond of what I have hoped for. I'm looking forward for the finished manuscript. I share your opinion of a lossless process. Filtering of OTUs should be the last step.

I use amplicon sequencing to genotype samples with multi-clone infections, so the final filtering step is very crucial to determine the correct multiplicity of infection in a sample.

Thanks a lot.

tobiasgf commented 8 years ago

I really like this approach about doing most (all) of the filtering as final steps. However I a wondering about how to keep the "..chimeric status, and quality values..." all the way to the OTU table. In other words how can you keep the ee-score (or ee/seq_length) of the best scoring read in an OTU through derepclication and swarm-clustering. Do you get that when you map back the "original reads" against the OTU representatives?

Looking forward to more.

frederic-mahe commented 8 years ago

I'll try to describe a complete pipeline (from raw fastq to OTU table). I found an ITS1 experiment suitable (i.e. not too complicated), that I will describe on a github wiki page. I'll post the link here when it will be ready.

Please be patient, I am completely swamped by other projects.

frederic-mahe commented 8 years ago

That page describes my OTU delineation pipeline and my strategy to preserve and use read quality values.

I hope it answers @nidlae's question.

lerch-a commented 8 years ago

Thanks a lot @frederic-mahe https://github.com/frederic-mahe. Yes this helps a lot.

On 1 May 2016, at 3:43 AM, Frédéric Mahé notifications@github.com wrote:

Closed #71 https://github.com/torognes/swarm/issues/71.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/torognes/swarm/issues/71#event-646579458

torognes / swarm

Make use of base quality score - Fastq support #71