torognes / swarm

A robust and fast clustering method for amplicon-based studies
GNU Affero General Public License v3.0
121 stars 23 forks source link

Develop a swarm-plugin for Qiime 2 #89

Open frederic-mahe opened 7 years ago

frederic-mahe commented 7 years ago

Qiime 2 now offers an interface for third-party plugins. The plugin creation does not seem complicated: the plugin is a python 3 wrapper presenting some or all the functionalities of swarm.

torognes commented 7 years ago

Sounds like a good idea!

stheil15 commented 6 years ago

Any update ?

colinbrislawn commented 5 years ago

With swarm 3.0 fast approaching (#122), the increasing popularity of Exact Sequence Variants, and the publication of the Qiime 2 paper, this might be the perfect time to build a q2-swarm plugin.

🚀 Colin

frederic-mahe commented 5 years ago

@colinbrislawn you are right, but I don't really know where to begin. Would you help me kickstart that plugin?

colinbrislawn commented 5 years ago

Thanks @frederic-mahe! I'm honored you reached out to me, but I'm not sure where to begin either. I guess I would look to the q2-vsearch plugin as a template, then build from there. https://github.com/qiime2/q2-vsearch

@thermokarst, could you make us an official q2-swarm repo and invite us as contributors?

thermokarst commented 5 years ago

Hey there @colinbrislawn! This plugin idea sounds really interesting, and good news, no need for us to make you a repo! Since QIIME 2 is decentralized, you can create the plugin wherever you want, then you can share it with users by registering it at the QIIME 2 Library! The Library entry can contain instructions letting users know how to get your plugin and install it.

colinbrislawn commented 2 months ago

Summary of steps in Fred's-metabarcoding-pipeline, as I understand it, and what's already wrapped in Qiime2:

program idea existing q2 plugin what
qiime tools import
cutadapt q2-cutadapt
vsearch fastq_mergepairs q2-vsearch
vsearch fastq_filter extend q2-vsearch add --fastq_filter
vsearch per-sample derep extend q2-vsearch
sed per-read quality none the lowest expected error rate observed for each unique sequence
vsearch global derep q2-vsearch
swarm make ASVs! none then sort with vsearch
vsearch uchime_denovo q2-vsearch
OTU_contingency_table.py make feature table

This is a fully featured pipeline that differs from what's already in Qiime2 in a number of ways. Specifically the per-sample derep...

One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.

This is in contrast to the DADA2 plugin that implements its full, unique SOP. This may be more powerful; per-sample derep and per-feature quality are interesting ideas!

Either way, adding --fastq_filter to q2-vsearch seems like a natural first step.

frederic-mahe commented 2 months ago

I should have pointed that sooner, here is my current swarm-based pipeline.

The way the pipeline is described (and scripts numbered) might be confusing. The beginning is quite similar to the old pipeline you were referring to: --fastq_filter is indeed required.

One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.

I realize that replicating the whole pipeline in Qiime2 might not be easy, so I agree we should aim for an easier first target.

swarm has three major modes: --differences 0 (dereplication), --differences 1 (fast, high-resolution clustering), --differences 2+ (slower, lower-resolution clustering)

In my own work, I only use --differences 1, with the --fastidious option when clustering the whole project (clustering()), or without the --fastidious option when working at the sample level (list_local_clusters()).

list_local_clusters() {
    # retain only clusters with more than 2 reads
    # (do not use the fastidious option here)
    ${SWARM} \
        --differences 1 \
        --threads "${THREADS}" \
        --usearch-abundance \
        --log /dev/null \
        --output-file /dev/null \
        --statistics-file - \
        "${SAMPLE}.fas" | \
        awk 'BEGIN {FS = OFS = "\t"} $2 > 2' > "${SAMPLE}.stats"
}
clustering() {
    # swarm 3 or more recent
    "${SWARM}" \
        --differences 1 \
        --fastidious \
        --usearch-abundance \
        --threads "${THREADS}" \
        --internal-structure "${OUTPUT_STRUCT}" \
        --output-file "${OUTPUT_SWARMS}" \
        --statistics-file "${OUTPUT_STATS}" \
        --seeds "${OUTPUT_REPRESENTATIVES}" \
        "${FINAL_FASTA}" 2> "${OUTPUT_LOG}"
}

The input file is a dereplicated fasta file with abundance annotations (;size=123[;]) (option --usearch-abundance), and only ACGT nucleotides. The command line options listed in these shell functions are the most relevant, at least in my opinion.

colinbrislawn commented 2 months ago

Thank you, this is extremely helpful!

I like the idea of starting small with the q2-swarm plugin.

I only use --differences 1, with the --fastidious option when clustering the whole project (clustering())...

Naturally!

I'm not sure how best to track feature counts through per-sample derep and clustering.

I understand what per-sample derep does and why it's faster to do this double-derep step. For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table? (I also had trouble with this last time.)

We could get counts for the feature table by remapping reads, like we did historically, but that loses the efficiency of the pre-sample derep and ignores the internal structure of the swarms.

frederic-mahe commented 2 months ago

I understand what per-sample derep does and why it's faster to do this double-derep step. For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table?

My pipeline must be confusing for anyone else but me, sorry about that. The loop processes each pair of fastq files in 6 steps:

The double-derep step allows me to keep track of the origin of each unique sequence. The fasta files are parsed when building the final occurrence table.

extract_expected_error_values produces a table containing the best (lowest) expected error observed for each unique sequence. This is also used to build the occurrence table, to filter out low quality observations (a cluster with high-EE seed sequence is discarded). The goal is to delay quality-based filtering, so it is performed after the clustering step.

list_local_clusters produces a table. The goal is to create a list of cluster seeds for each sample. This is used to cleave clusters into subclusters when the subclusters display different patterns of distribution. In practice, it allows to distinguish entities with a single-nucleotide difference, but only when there is an ecological signal.