Response to inline comments from Moritz

jfy133 commented 1 year ago

Due to the scale of the problem, taxonomic profiling remains an 'unresolved problem' in bioinformatics. !--# (Moritz): In my opinion, the more important part of the problem that remains unresolved is that the accuracy (and other confusion matrix metrics) of profilers is doubtful because we do not have any biological ground truth data. Only with simulated sequencing reads can we know the real abundances, but how well that translates to real world sequencing data of complex communities is not known. Additionally, DNA extraction protocols are biased and so, even for mock communities, we can't know if deviations in reported abundances are due to extraction, sequencing, or classification biases. -->

TODO I completely agree,I admitted I was trying to use the 'scale of the problem' as a 'shallow' reason to segue into the next paragraph, I'll try and restructure this paragraph to emphasise reason for the diversity of tools. That said I don't think the pipeline is solving any of those problems (other than maybe giving people more options for comparing) so I don't want to go into too much detail

✅ Rephrase to note about that we want to infer real baundances but this is difficult

Additionally, particularly for very large sample sets, there is increasing use of cloud platforms that have greater scalability than traditional HPCs. Being able to reliably and reproducibly execute taxonomic classification tasks across infrastructure with minimal intervention would therefore be a boon for the metagenomics field. !--# (Moritz): Do we need to address here that, of course, other pipelines already exist? -->

✅ done

nf-core/taxprofiler utilises Nextflow [@Di_Tommaso2017-xu] to ensure efficiency, portability, and scalability, and has been developed within the nf-core initiative of Nextflow pipelines [@Ewels2020-vi] to ensure high quality coding practises and user accessibility, including detailed documentation and a graphical-user-interface (GUI) execution interface. !--# (Moritz): Mention nf-tower for the GUI as well? -->

✅ done

Per-classifier flags are also available for the optional saving of additional non-profile output files. !--# (Moritz): I typically use a YAML file to supply pipeline parameters and think that is better practice. I do understand that you want to show just one command here with all the options, though. -->

✅ done

nf-core/taxprofiler aims to support and include all established classification or profiling tools as requested by the community. !--# (Moritz): I was actually wondering about this the other day. I think there are two possible approaches: 1. We add all tools on a per request basis as stated here. Upside: Everybody's happy. Downside: Maintenance becomes ever more burdensome over time. Do we ever remove profilers? 2. The taxprofiler maintainers choose the "best" representative tool within a specific category. -->

✅ I think this is worth a separate discussion in person or on slack. We don't have to keep update versions if no-one is interested in the tool we have a fixed container so that shouldn't be problem, and I don't think the code around profilers will change much, so I don't see why we would need to remove stuff. I also find that most of the tools we've included are relatively 'static' and don't change much other than default databases. That said, we don't explicitly say we wouldn't remove stuff with this sentence ;) so I'll leave it as it is.

| Nucleotide | k-mer based | whole-genome | profiler | Bracken | !--# (Moritz): AFAIK Bracken redistributes reads assigned to LCAs down to species (or other chosen rank). So it is still providing sequence abundance rather than taxonomic abundance. -->

✅ this was already done with the specification of a 'profiler' rather than a classifer.

Thus, the need for highly-multiplexed classification is more desirable for the newer metagenomics methods. Despite this, tools such as METAXA2 [@Bengtsson-Palme2015-ar] that use shotgun sequencing reads to recover 16S sequences from metagenomic samples. !--# (Moritz): Last sentence feels tagged on and disconnected. -->

✅ it got lost, moved back to a relevant section and extended

portability [@Wratten2021-es]. After searching, we selected the following pipelines for comparison with nf-core/taxprofiler: sunbeam [v4; @Clarke2019-al], Unipro UGENE [v48; @Rose2019-jf], TAMA [githash: 3a22c8f; @Sim2020-ja], and StaG-mwc [0.7.0; @Boulund2023-ct]. !--# Should we mention also other pipelines that we found but excluded based on these criteria? -->

✅ I don't feel it's necessary, that we filtered down implies there are others out there.

Unipro UGENE is the only pipeline that supports execution on all three major operating systems (Linux, OSX, Windows), whereas StaG-mwc and nf-core/taxprofiler can be run on unix operating systems, and sunbeam and TAMA are only being supported on Linux. !--# (Moritz): Is this so? Nextflow requires bash, but there is git bash, anaconda terminal, and the WSL for Windows. So this seems a bit strict? -->

✅ Changed 'explicit' support, WSL is possible but we don't test for it. UGENE explicitly says it supports windows natively

Most pipelines support some form of host removal (only TAMA did not support this), and it is likely possible with Unipro UGENE through user customisation of the workflow. !--# (Moritz): I don't think user customization should be mentioned. We can only compare what comes out-of-the-box. Otherwise you can argue that taxprofiler can be customized to do anything. -->

✅ user customisation was maybe the wrong phrasing, basically it's sort of implied you could do it but the documentation doesn't explicitly show it. Rephrase to make that better.

For output, nf-core/taxprofiler, StaG-mwc, and sunbeam (via an extension) support a singular run report for summarising all preprocessing step. Only nf-core/taxprofiler and TAMA produce standardised output for all taxonomic profilers (via TAXPASTA). !--# (Moritz): This makes it sound like TAMA also uses TAXPASTA. -->

✅ rephrased

The functionality offered by other pipelines not currently supported by nf-core/taxprofiler include sequencing saturation estimation (StaG-mwc), taxonomy-free composition comparison (StaG-mwc), functional profiling (StaG-mwc) !--# (Moritz): Deliberately considered out of scope for taxprofiler due to funcscan. -->, de novo assembly (sunbeam) !--# (Moritz): Deliberately considered out of scope for taxprofiler due to mag. --> , and reference mapping (StaG-mwc, sunbeam).

✅ explciityl stated

!--# (Moritz): I think this paragraph could be moved up where I noted that we might mention excluded pipelines. --> We note there exists a range of other pipelines that also include some form of taxonomic classification. However often these pipelines have been developed with a different main purpose (e.g. Assembly and binning for nf-core/mag [@Krakau2022-we], MetaWRAP [@Uritskiy2018-ut], SqueezeMeta [@Tamames2018-zq], or MEDUSA [@Morais2022-rt]; Metagenomic read alignment with CCMetaGen [@Marcelino2020-rg] and Wochenende [@Rosenboom2022-bt]).

Given there still are no curated, high-quality 'gold standard' databases in metagenomics, and while nf-core/taxprofiler allows the profiling against multiple databases and settings in parallel, currently the pipeline still requires users to construct these manually and to supply to the pipeline. While we feel this is currently a reasonable investment as such databases can be repeatedly re-used, we are exploring the possibility to add an additional complementary workflow in the pipeline to allow automated database construction of all classification tools, given a set of FASTA reference files. !--# (Moritz): I'm personally against including it in the same pipeline, but that's another discussion xD. -->

✅ Started on slack

jfy133 commented 1 year ago

All addressed (i hope)!

Midnighter commented 1 year ago

this was already done with the specification of a 'profiler' rather than a classifer.

My point is that Bracken is a classifier and not a profiler. It still reports sequence abundance, not taxonomic abundance.

jfy133 commented 1 year ago

My reading/understanding of the section 'Classification versus abundance estimation' in https://peerj.com/articles/cs-104/

Is that is what it is doing?

Therefore, any assumption that Kraken’s raw read assignments can be directly translated into species- or strain-level abundance estimates (e.g.,  Schaeffer et al., 2015) is flawed, as ignoring reads at higher levels of the taxonomy will grossly underestimate some species, and creates the erroneous impression that Kraken’s assignments themselves were incorrect.

Nonetheless, metagenomics analysis often involves estimating the abundance of the species in a particular sample. Although we cannot unambiguously assign each read to a species, we would like to estimate how much of each species is present, specifically by estimating the number or percentage of reads in the sample. 

<...>

Rather than re-engineer Kraken to address the ambiguous read classification issue and to provide abundance estimates directly, we decided to implement the new species-level abundance estimation method described here as a separate program

Unless I'm misunderstanding what they mean by 'species abundance ' (as it's never really defined...)?

jfy133 commented 1 year ago

Unless they mean that kraken2s sequence abundance is inaccurate so they restimate species level sequence abundance?

To compute species abundance, any genome-level (strain-level) reads are simply added together at the species level. In cases where only one genome from a given species is detected by Kraken in the dataset, we simply add the reads distributed downward from the genus level (and above) to the reads already assigned by Kraken to the species level. In cases where multiple genomes exist for a given species, the reads distributed to each genome are combined and added to the Kraken-assigned species level reads. The added reads give the final species-level abundance estimates.

Ugh terminology...

Midnighter commented 1 year ago

The added reads give the final species-level abundance estimates.

This is my understanding of Bracken. It is simply a redistribution of reads. If kraken2 had already assigned all reads at the species-level (hypothetically), then Bracken would make no further changes.

In my mind, the only tools I would call profilers are those using marker genes that give you a taxon abundance. All the other tools can only provide (protein) sequence abundance AFAIK and thus are classifiers.

jfy133 commented 1 year ago

Ok fair enough, I see the logic there. I will update.

jfy133 commented 1 year ago

I've removed the classifier/profiler column from table, and tweaked further the phrasing to hopefully make the distinction you made above. https://github.com/jfy133/taxprofiler-manuscript/commit/00f700f1dbf25c187140dbc183d42bfbe8e4c5ad

Midnighter commented 1 year ago

To be clear: I do not claim that my interpretation is the correct one. Hopefully, the changes that you made serve to avoid confusion. Cheers ☺️

jfy133 commented 1 year ago

Nope it actually makes sense when I read more into exactly what it is doing than just the general description :grimacing:

jfy133 / taxprofiler-manuscript

Response to inline comments from Moritz #12