qiime2 / q2-taxa

BSD 3-Clause "New" or "Revised" License
3 stars 29 forks source link

Collapse will create different features for the same level, because of non rank-aware padding #137

Open FranckLejzerowicz opened 3 years ago

FranckLejzerowicz commented 3 years ago

Improvement Description Taxon path padding can be made rank aware to avoid the following situation: Say you have taxonomic classifications such as:

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

(the first is only assigned with min. confidence to o__Clostridiales, and the other one, the unassigned species of Clostridiales: o__Clostridiales; f__; g__; s__).

Current Behavior Collapsing the above example to genus would not collapse to the same level, but result in two separate features:

k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;__;__
k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__; g__

Proposed Behavior Make both of these collapse to the same taxon, which for the above would be:

k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__; g__

Note: this would only be feasible if the ranks are homogeneous (which is the case for e.g. GreenGenes and other good databases).

Comments The current padding happens in a function nested within the _collapse_table() function of _util.py. I'd say this could be made more important functions (larger scope), that would be controlled by a command line parameter, in order to let user decide whether he/she prefers to remove taxa that would need padding, for example, one may want to get rid of things not annotated to genus after collapsing ot genus (in the above example, the two entries would be deleted).

Note: I am making a PR for this - see below

nbokulich commented 3 years ago

this is really quite specific to greengenes... other databases usually do not have empty annotations, and might use different rank padding conventions.

Furthermore, is this really desirable? collapsing features assigned to different taxonomic ranks seems to make many assumptions. For the sake of plotting, it might be convenient to collapse the following, but in terms of interpretation these taxonomies could mean very different things:

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

i.e., features assigned these different taxonomies should not be considered the same at family or genus level... it might be convenient to lump these together as "all clostridiales with unknown family and genus", but these could in fact belong to very different groups, so lumping together could smooth over important differences.

FranckLejzerowicz commented 3 years ago

Hello @nbokulich,

In the proposed PR, the padding would only happen if there is a rank convention for the database used for classification, in particular, only for ;-separated taxonomic fields that have homogenous rank labels through out the taxonomy.

Many databases follow the convention of "aligned" rank names, e.g. the single letter for greengenes and PR2, or none for SILVA (taxonomy dump lookup). In case of no convention, note that this PR would do the current padding (adding __ down to the collapsing level).

One caveat: the rank is inferred using the string before the first _ character. If no _ character is present, the padding would remain __ (current behaviour). However, unusually taxonomy would create unusual padding, e.g. the unlikely, poor taxonomy:

taxon_name1; taxon_name1.1; taxon_name1.1.1
taxon_name1; taxon_name1.2; taxon_name1.2.1
taxon_name2; taxon_name2.1

would pad to:

taxon_taxon_name1; taxon_taxon_name1.1; taxon_taxon_name1.1.1
taxon_taxon_name1; taxon_taxon_name1.2; taxon_taxon_name1.2.1
taxon_taxon_name2; taxon_taxon_name2.1; taxon_

The code can be made robust to such edge cases, and so, I'd say that having the possibility to fix the above issue is indeed desirable. Here's a few thoughts about your points:

In fact, the real issue is that QIIME2 creates features with empty taxonomic rank (__ padding) for sequences not assigned down to the collapsing rank. Let's illustrate with another example and collapse at genus level: for these ASVs:

ASV1    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
ASV2    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
ASV3    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales
ASV4    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

will first be created:

ASV1    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
ASV2    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
ASV3    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
ASV4    k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__

and then after the collapse, are obtained the features:

k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __
k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; g__; s__

In this case, the three different ASVs that were only assigned to o__Clostridiales - and thus that are potentially novel, i.e. unexpected in the microbial system - would be lumped together, creating the issue you highlighted.

I agree that padding should be avoid when the rank do not exist in the first place (and notably if exist elsewhere for "unassigned" assignments), but isn't it an issue that created features such as k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; __; __ are confusing and should be discarded instead?

nbokulich commented 3 years ago

hey @FranckLejzerowicz , your explanation above makes it clear that the PR connected to this would cause more problems than it would solve, as this makes assumptions about taxonomic conventions that are not universal. Semicolon-delimited taxonomy is the only convention you mention that is followed by Q2.

I recommend creating a separate plugin for this action (q2-greengenes?), but curious what others think

FranckLejzerowicz commented 3 years ago

I indeed tried to identify where this would break (and be improved to solve the issue). Note that if just one semicolon-delimited taxon do not have all its first's underscore-delimited characters the same, this PR would do nothing. Hence, the assumptions about taxonomic conventions that are not universal could be solved by adding a parameter:

  --p-conventional-ranks / --p-no-conventional-ranks
                         The taxonomy is labeled with conventional ranks, e.g. k__Bacteria
                         (and not just Bacteria).                              [default: False]

Since I agree with your first point above on lumping different things, the issue remains that users would get features created by padding to the collapsing level.

i.e., collapsing

ASV1   k__Bacteria
ASV2   k__Bacteria
ASV3   k__Bacteria
ASV4   k__Bacteria
ASV5   k__Bacteria
ASV6   k__Bacteria

at genus level would yield:

ASV1   k__Bacteria;__;__;__;__
ASV2   k__Bacteria;__;__;__;__
ASV3   k__Bacteria;__;__;__;__
ASV4   k__Bacteria;__;__;__;__
ASV5   k__Bacteria;__;__;__;__
ASV6   k__Bacteria;__;__;__;__

Collapsing to:

k__Bacteria;__;__;__;__

Not sure the user readily understands the assumptions beyond this feature, vs. k__Bacteria;p__;__;__;__, k__Bacteria;p__;o__;__;__, or k__Bacteria;p__;o__;f__;__ Sorry for the long messages and for wasting you time if this is not relevant.