nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
188 stars 119 forks source link

Proposal: LULU ASV post-clustering curation #609

Closed a4000 closed 1 year ago

a4000 commented 1 year ago

Description of feature

I have a LULU subworkflow that I can add to Ampliseq. The subworkflow uses blastn to create the matchlist for LULU, then uses LULU for post-clustering curation. The input files for the subworkflow are an asv fasta file and a tsv file. The tsv file is similar to the DADA2_table.tsv file that is already produced in Ampliseq. The output file is a curated version of that tsv file. I feel this should be easy enough to add to Amliseq.

erikrikarddaniel commented 1 year ago

I suggest (like in #608) VSEARCH instead of BLASTN. Could save a lot of resources/energy I think.

We recently had a Slack discussion regarding post-denoising clustering (https://nfcore.slack.com/archives/CEA7TBJGJ/p1690893776838869), so that seems to be something people want. In that discussion, another tool -- swarm -- was proposed. I don't know what's best or most commonly used.

a4000 commented 1 year ago

Swarm is another tool I haven't tried. I'll test it out and look into the literature to see which tool seems more popular/better.

a4000 commented 1 year ago

I just read this interesting article (https://archimer.ifremer.fr/doc/00688/80057/83060.pdf), they used DADA2 for ASVs, then they used swarm on the ASVs go get OTUs, then they ran LULU to curate the ASVs and OTUs to check which methods produce the most accurate results. To sum up their argument, the best method depends on your dataset and taxa of interest. So maybe it's best to have both swarm and LULU as optional steps.

a4000 commented 1 year ago

I noticed the qiime2 modules have these lines

container "qiime2/core:2022.11"

// Exit if running this module with -profile conda / -profile mamba
if (workflow.profile.tokenize(',').intersect(['conda', 'mamba']).size() >= 1) {
    exit 1, "QIIME2 does not support Conda. Please use Docker / Singularity / Podman instead."
}

Is it acceptable for me to do something similar to this when I just have a docker container for a tool, but no conda?

d4straub commented 1 year ago

If the docker container was/is maintained over time (when no conda packages are available that isnt to be taken for granted) and there is no way around it (such as choosing a similar tool that provides conda & container), then yes. In the example you quote its QIIME2, which is a very popular tool, so that warrants the decision here I think.

a4000 commented 1 year ago

I've looked into this issue a bit more.

Problem: LULU doesn't have a conda package, and while there is a docker container I'm using, it isn't part of the quay.io/biocontainers registery.

Is there a better method of post-clustering? It's hard to say. I can't find many studies testing different methods of post-clustering of ASVs. The LULU paper did compare LULU to post-clustering with dbotu3, with LULU performing better. I found a more recent tool called ReClustOR, but if I'm reading their paper correctly, it doesn't seem like they compared the tool to other post-clustering methods. They don't even mention LULU despite this paper being published after LULU's paper. I also can't find any Conda packages or Docker containers, so I'm not sold on this tool yet.

Is LULU popular? I found this paper that provides an overview of pipelines for metabarcoding studies and the paper mentions five pipelines that use LULU and one that uses ReClustOR.

Postclustering tools, such as LULU (Frøslev et al., 2017) are implemented in AMPTk (Palmer et al., 2018), eDNAflow (Mousavi-Derazmahalleh et al., 2021), APSCALE (Buchner et al., 2022), LotuS2, PipeCraft2 and ReClustOR (Terrat et al., 2020) in BIOCOM-PIPE.

So from what I can see, it seems like LULU is a relatively popular tool for post-clustering.

Do we need a tool designed for post-clustering curation, or can we use any clustering tool (e.g., Swarm) to functionally perform a similar role? The paper I mentioned earlier in this issue thread argues that post-clustering curation with LULU is different from post-clustering with Swarm and that the two have different purposes.

This indicates that LULU curation merges less ASVs than the amount grouped through clustering, and highlights the different purposes of both tools, LULU effectively removing spurious OTUs, while clustering allows removing haplotype diversity.

So maybe post-clustering with Swarm should be a separate issue. That's something I'll think about more.

The Docker container I'm using isn't part of the biocontainers registery, but it does work. I can look into the process of getting a container added to the biocontainers repository if that would provide more confidence to the rest of the Ampliseq team, but I won't do that if it's not necessary. I have most of the code written locally to add this feature to Ampliseq, so it's just the container/conda issue that's holding me back.

d4straub commented 1 year ago

I am with @erikrikarddaniel that VSEARCH is a fine tool and it has also proper containerization. And it can cluster sequences.

Papers from software developers about their own tools should be typically taken with a bit of skepticism, independent benchmarks are usually better. But benchmarks are sometimes hard to generalize, still, the best we got to decide on tools ofc.

I had a quick look at bioconda, swarm is listed, also AMPTk and apscale. Possibly those last two containers have all requirements for LULU, thats also a way to get it ;)

Next steps if you really want to go with LULU ask the LULU devs to add it to bioconda. Otherwise there is in nf-core slack the channel #bioconda that might have experts to help.

a4000 commented 1 year ago

I think for now I'll try VSEARCH because there is already an nf-core module for VSEARCH_CLUSTER. I have noticed a bug with this module, so I'll fix the bug in the module and try adding the fixed module to the pipeline.

a4000 commented 1 year ago

I'm closing this issue for now. I've added VSEARCH instead of LULU for ASV post-clustering.