Allow criteria other than abundance when merging taxa

bobthacker commented 3 years ago

Thanks for this great resource! The current merge_taxa_vec function names the new taxa according to the most abundant taxon in each group or the first taxon in each group. In some cases, merged taxa have reference sequences of differing lengths in the phyloseq object. In these cases, users might want to name the new taxa to retain only the shortest of the merged sequences (or potentially only the longest of the merged sequences), regardless of abundance. Could an option be added near the block 126-144 to evaluate length of the reference sequences?

mikemc commented 3 years ago

Hi @bobthacker, thanks for your suggestion. I agree that other options for how the new "taxa" are created would be useful, but need to think a bit about the best way to do this and so have been defaulting to the phyloseq "archetype" method. The choice may be a bit more complicated than it seems because we don't just need to determine the new taxon name, we also need to determine how to set its corresponding value in the tax_table, phy_tree, and refseq slots. By far the simplest way to do this is to have one "archetype" taxon for each group (set of taxa to be merged, which is used for all values and the taxon name (the current strategy). But all sorts of alternatives are possible. For example, there is no general reason for the sequence to be the taxon name (I typically set the refseq slot to track the sequences, and set taxon names to e.g., ASV1, ASV2, etc.), and you could imagine using names such as OTU1, OTU2, etc for the new taxa names but making the reference sequences equal to a consensus sequence.

Can you confirm that you want the shortest and/or longest name'd (or refseq'd?) taxon to be the archetype for everything (taxonomy, phylogeny, and refseq slot), not just the new name?

A sensible enhancement that might work in your case is to allow the user to specify the desired archetypes in a custom fashion, rather than always using the most abundant or first taxon. Currently, the values of the "group" vector are only used to define the grouping, but these could be set to the desired group archetype (in your case, the taxon with the shortest sequence). I would then add an option archetype with possible values ("max", "first", and "group"), with the "group" option choosing the archetype from the group names.

It would still be up to you to set the group vector to your desired archetype taxa - I could drum up a code snippet to help with this.

I'm hesitant to hard-code your specific case since right now I can't imagine many people wanting the shortest or longest sequence rather than the most abundant sequence or a consensus sequence. (Others please chime in if this is a feature you want.)

I'll have to think more if there is another sensible way to make your task easier but as a special case of in a more generally useful way. For example, we could allow the user to supply a function that could be applied to either the taxa names or refseq sequences for the group and pull out one taxon to use as the archetype. E.g. In your case, that function could be something like function(x) which.min(nchar(x)) if applied to taxa names and function(x) which.min(width(x)) if applied to refseq sequences.

bobthacker commented 3 years ago

Hi @mikemc, Yes, if the criterion were sequence length, then for example the shortest base pair sequence would then be the taxon used as the archetype for everything, if users are super-conservative about sequence information content; alternatively the longest base pair sequence if users are willing to be riskier but maximize information content. I agree that having some sort of custom function would be the best option. There might be different reasons for choosing different criteria. One example would be taking new data and merging it with a "gold standard" or reference data set. In that case, there might be a sample_data column that indicates something is from the trusted reference data set, and that might be selected for the archetype. Another example is selecting the taxa with the lowest level of taxonomic classification. Since there are a lot of ways that users might evaluate what makes the best archetype, having the ability to supply a function to make that decision would be very useful, while keeping a default selection of abundance.

mikemc / speedyseq

Allow criteria other than abundance when merging taxa #58