Open frederic-mahe opened 7 years ago
Sounds like a good idea!
Any update ?
With swarm 3.0 fast approaching (#122), the increasing popularity of Exact Sequence Variants, and the publication of the Qiime 2 paper, this might be the perfect time to build a q2-swarm plugin.
🚀 Colin
@colinbrislawn you are right, but I don't really know where to begin. Would you help me kickstart that plugin?
Thanks @frederic-mahe! I'm honored you reached out to me, but I'm not sure where to begin either. I guess I would look to the q2-vsearch plugin as a template, then build from there. https://github.com/qiime2/q2-vsearch
@thermokarst, could you make us an official q2-swarm repo and invite us as contributors?
Hey there @colinbrislawn! This plugin idea sounds really interesting, and good news, no need for us to make you a repo! Since QIIME 2 is decentralized, you can create the plugin wherever you want, then you can share it with users by registering it at the QIIME 2 Library! The Library entry can contain instructions letting users know how to get your plugin and install it.
Summary of steps in Fred's-metabarcoding-pipeline, as I understand it, and what's already wrapped in Qiime2:
program | idea | existing q2 plugin | what |
---|---|---|---|
qiime tools import | |||
cutadapt | q2-cutadapt | ||
vsearch | fastq_mergepairs | q2-vsearch | |
vsearch | fastq_filter | extend q2-vsearch | add --fastq_filter |
vsearch | per-sample derep | extend q2-vsearch | |
sed | per-read quality | none | the lowest expected error rate observed for each unique sequence |
vsearch | global derep | q2-vsearch | |
swarm | make ASVs! | none | then sort with vsearch |
vsearch | uchime_denovo | q2-vsearch | |
OTU_contingency_table.py | make feature table |
This is a fully featured pipeline that differs from what's already in Qiime2 in a number of ways. Specifically the per-sample derep...
One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.
This is in contrast to the DADA2 plugin that implements its full, unique SOP. This may be more powerful; per-sample derep and per-feature quality are interesting ideas!
Either way, adding --fastq_filter
to q2-vsearch seems like a natural first step.
I should have pointed that sooner, here is my current swarm-based pipeline.
The way the pipeline is described (and scripts numbered) might be confusing. The beginning is quite similar to the old pipeline you were referring to: --fastq_filter
is indeed required.
One easy way forward is to make a q2-swarm plugin that replaces only the vsearch cluster-features-de-novo.
I realize that replicating the whole pipeline in Qiime2 might not be easy, so I agree we should aim for an easier first target.
swarm
has three major modes: --differences 0
(dereplication), --differences 1
(fast, high-resolution clustering), --differences 2+
(slower, lower-resolution clustering)
In my own work, I only use --differences 1
, with the --fastidious
option when clustering the whole project (clustering()
), or without the --fastidious
option when working at the sample level (list_local_clusters()
).
list_local_clusters() {
# retain only clusters with more than 2 reads
# (do not use the fastidious option here)
${SWARM} \
--differences 1 \
--threads "${THREADS}" \
--usearch-abundance \
--log /dev/null \
--output-file /dev/null \
--statistics-file - \
"${SAMPLE}.fas" | \
awk 'BEGIN {FS = OFS = "\t"} $2 > 2' > "${SAMPLE}.stats"
}
clustering() {
# swarm 3 or more recent
"${SWARM}" \
--differences 1 \
--fastidious \
--usearch-abundance \
--threads "${THREADS}" \
--internal-structure "${OUTPUT_STRUCT}" \
--output-file "${OUTPUT_SWARMS}" \
--statistics-file "${OUTPUT_STATS}" \
--seeds "${OUTPUT_REPRESENTATIVES}" \
"${FINAL_FASTA}" 2> "${OUTPUT_LOG}"
}
The input file is a dereplicated fasta file with abundance annotations (;size=123[;]
) (option --usearch-abundance
), and only ACGT nucleotides. The command line options listed in these shell functions are the most relevant, at least in my opinion.
Thank you, this is extremely helpful!
I like the idea of starting small with the q2-swarm plugin.
I only use --differences 1, with the --fastidious option when clustering the whole project (
clustering()
)...
Naturally!
I'm not sure how best to track feature counts through per-sample derep and clustering.
I understand what per-sample derep does and why it's faster to do this double-derep step. For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table? (I also had trouble with this last time.)
We could get counts for the feature table by remapping reads, like we did historically, but that loses the efficiency of the pre-sample derep and ignores the internal structure of the swarms.
I understand what per-sample derep does and why it's faster to do this double-derep step. For some reason, I can't wrap my head around that q2-type it should be. Is this just another feature table?
My pipeline must be confusing for anyone else but me, sorry about that. The loop processes each pair of fastq files in 6 steps:
merge_fastq_pair
(R1-R2 merging with vsearch
)trim_primers
(cutadapt
)convert_fastq_to_fasta
(vsearch
)extract_expected_error_values
dereplicate_fasta
(dereplication with vsearch
)list_local_clusters
(local clustering with swarm
)The double-derep step allows me to keep track of the origin of each unique sequence. The fasta files are parsed when building the final occurrence table.
extract_expected_error_values
produces a table containing the best (lowest) expected error observed for each unique sequence. This is also used to build the occurrence table, to filter out low quality observations (a cluster with high-EE seed sequence is discarded). The goal is to delay quality-based filtering, so it is performed after the clustering step.
list_local_clusters
produces a table. The goal is to create a list of cluster seeds for each sample. This is used to cleave clusters into subclusters when the subclusters display different patterns of distribution. In practice, it allows to distinguish entities with a single-nucleotide difference, but only when there is an ecological signal.
Qiime 2 now offers an interface for third-party plugins. The plugin creation does not seem complicated: the plugin is a python 3 wrapper presenting some or all the functionalities of swarm.