nf-core / ampliseq

Amplicon sequencing analysis workflow using DADA2 and QIIME2
https://nf-co.re/ampliseq
MIT License
182 stars 115 forks source link

Proposal: Phyloseq R object creation at end of pipeline #612

Closed a4000 closed 1 year ago

a4000 commented 1 year ago

Description of feature

I can add a module to Ampliseq that would produce a phyloseq object. I think the main benefit of a phyloseq object comes from the fact that R is a popular language for data analysis. The object would have 3 or 4 elements. The sample metadata, the ASV count table (with ASVs as the row names and samples as column names), a taxonomy table (with ASVs as the row names and taxonomy levels as the column names), and possibly also a phylogenetic tree.

d4straub commented 1 year ago

Yes, that would be handy! edit: I forgot, are 2 elements also possible (i.e. without metadata % tree) for a phyloseq object? The application case would be that someone would skip downstream analysis via ampliseq by not supplying metadata and would like to do downstream analysis outside of ampliseq and aims to use the phyloseq object for that.

a4000 commented 1 year ago

I'm not 100% sure if phyloseq allows creating an object without the metadata (I can test that tomorrow), but if it doesn't, could maybe create a dummy metadata sheet with the sample names. I do know that it is possible to create the object without a tree, so that shouldn't be an issue.

cpauvert commented 1 year ago

I'm not 100% sure if phyloseq allows creating an object without the metadata (I can test that tomorrow), but if it doesn't, could maybe create a dummy metadata sheet with the sample names. I do know that it is possible to create the object without a tree, so that shouldn't be an issue.

Yes, it is possible to create a phyloseq object starting from any of the components class (OTU, sample, tree, taxonomy). The example of the constructor function does not have metadata ; )

a4000 commented 1 year ago

I've decided the first thing I'll add to Ampliseq is the phylogenetic tree because I figured it would be an easy place to start while I familiarise myself more with Ampliseq. I have some questions for the Ampliseq team to make sure I'm on the right track with implementing this feature.

For the table of ASV counts, I'm planning on using the table found in ch_dada2_asv, though I could also use the filtered table in QIIME2_FILTERTAXA.out.tsv if that table exists. Are there other count tables I should consider using?

From what I can tell, the pipeline can produce 1 or more taxonomy tables. It should be easy enough to produce multiple phyloseq objects depending on which taxa tables exist. I've found two taxa tables in the pipeline that are already in the correct format for what I need found in ch_dada2_tax and ch_sintax_tax. I've found two other taxa tables that are in slightly different formats found in ch_pplace_tax and QIIME2_TAXONOMY.out.tsv. I'm wondering if I should add modules to reformat these tables or if there are other tables I haven't found yet?

I've found a nwk phylogenetic tree produced by the pipeline in FASTA_NEWICK_EPANG_GAPPA.out.grafted_phylogeny. I believe there's also a tree produced by QIIME? The problem is that this tree has taxonomy names as the tip labels of the tree while phyloseq expects the tip labels to match the ASV names. I'm wondering if the pipeline produces a tree with those ASV names as the tip labels, or if adding the tree to the object will require a different plan (e.g., producing a different tree that has those ASV names)?

a4000 commented 1 year ago

Actually, I tried running the pipeline with my phyloseq module and it successfully added the tree, so you can probably disregard that point.

To elaborate on my point about the tax tables. The dada2 and Sintax tax tables have the different tax levels as columns, while the other two tax table just have one taxonomy column. The Phyloseq object will still be created, but it might be beneficial to have the tax tables in a more consistent format.

d4straub commented 1 year ago

For the table of ASV counts, I'm planning on using the table found in ch_dada2_asv, though I could also use the filtered table in QIIME2_FILTERTAXA.out.tsv if that table exists. Are there other count tables I should consider using?

Essentially, the ASV count table is produced by DADA2, subsequently optionally filtered by some custom filter scripts here ending up in ch_dada2_asv in all cases. Then optionally filtered by QIIME2 and exported as TSV in here as ch_tsv = QIIME2_FILTERTAXA.out.tsv. So yes, those two should do it.

From what I can tell, the pipeline can produce 1 or more taxonomy tables. It should be easy enough to produce multiple phyloseq objects depending on which taxa tables exist. I've found two taxa tables in the pipeline that are already in the correct format for what I need found in ch_dada2_tax and ch_sintax_tax. I've found two other taxa tables that are in slightly different formats found in ch_pplace_tax and QIIME2_TAXONOMY.out.tsv. I'm wondering if I should add modules to reformat these tables or if there are other tables I haven't found yet?

Four tools are currently able to produce tax tables. One particular tax table is chosen for downstream analysis and is imported to QIIME2 (if it is run) here as ch_tax. All tax tables are used by QIIME2 here to export aggregated and merged (ASV count & tax) tables. I imagine something similar could be done for phyloseq objects (outside of QIIME2 obviously). About reformatting, tax has always ; as separator, if separated at all, I think. I believe QIIME2 is accepting only tax tables with ; separated tax levels, meaning qiime2 needs that format. One module should do for reformatting for your needs, I assume.

d4straub commented 1 year ago

Is that issue solved by https://github.com/nf-core/ampliseq/pull/615? If yes, please close it.