joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

Mixed level taxa label for plots #213

Closed Mikeyj closed 11 years ago

Mikeyj commented 11 years ago

Hi Joey,

suggestion for consideration rather than bug.

Sometimes if I label OTUs on a plot using Genus there are labels missing due to that particular taxonomic level being absent for that OTU. In addition it is also possible to have multiple distinct OTUs with the same name. This can make referencing particular OTUs of interest difficult in the text of papers.

How I've got round this in the past is to create a unique name for each OTU that includes the OTU number. In addition I use the best taxonomic level available for any particular OTU. For example:

#Convert a .biom format to old style OTU table (this is how we got stuff into R before your nice .biom reading package :)
convert_biom.py -i otu_table.biom -o otu_table.txt -b --header_key taxonomy --output_metadata_id "ConsensusLineage"
#Read in the OTU table, ignoring the commented lines (beginning with hash) and skipping the first line:
otu = read.table("otu_table.txt", sep = "\t", row.names=1, header=T, check.names=F, comment.char="", skip = 1)
#Install the stringr package, if necessary, and load the library
install.packages("stringr")
library(stringr)
#Extract the final taxonomic level (here referenced by 'last word after final semicolon') and combine with the OTU number (row name) to create unique and informative ID. I think the semicolon is standard, even if you use a retrained RDP classifier. Could be improved to ignore "unclassified" and go to next level up?
OTUID = str_c(str_extract(otu$ConsensusLineage, "[^;]+$"),"_",row.names(otu))

In the current scheme your OTUs at the Genus level would be called:

gPseudmonas gPseudmonas

g__Pseudomonas Whereas with the above you could have g__Pseudomonas_321 g__Pseudomonas_197 f__Pseudmonadaceae_591 s__Psedumonas_aeruginosa_101 s_Unclassified_2 Each has a unique name and the maximum tax info available. I need to tweak things to skip the unlcassified level and go back up one. I think the above will work with other taxonomy schemes as well, but the Unclassified issue may not. This would have to be a labelling scheme that existed alongside the current scheme as it's not useful for subsetting at particular levels, only for labelling plots informatively.
joey711 commented 11 years ago

@Mikeyj

So if you were going to use the current framework in phyloseq, which I obviously recommend O:-), you could simply do a row-wise concatenation of the relevant taxonomic information you're after.

You can use the import infrastructure in phyloseq to parse the taxonomy for you. The r__ prefixes in front of your taxonomic elements should be trimmed off, since their position in the taxonomy table already indicates the rank. This is done automatically when you specify parse_taxonomy_greengenes as the parseFunction argument in import_biom.

Try this reproducible example for concatenating the taxonomy as a particular string for plotting. This example contains only 10 taxa so it is fast to reproduce and easy to read, even on a tree created at the end. The first line of the multi-line plot_tree call includes the argument adding referring to the new concatenated dummy "rank" that I added to the taxonomy table of the phyloseq object, x10.

library("phyloseq")
data("GlobalPatterns")
x10 = prune_taxa(tail(names(sort(taxa_sums(GlobalPatterns))), 10), GlobalPatterns)
# Fix the node labels for the sake of the tree at the end:
phy_tree(x10)$node.label <- gsub("\\.[[:alnum:]]+$", "", phy_tree(x10)$node.label)
# Add a new rank, Strain, with the OTU ids
tax_table(x10) <- cbind(tax_table(x10), Strain=taxa_names(x10))
# Define the ranks you want to include
myranks = c("Genus", "Species", "Strain")
mylabels = apply(tax_table(x10)[, myranks], 1, paste, sep="", collapse="_")
# Add concatenated labels as a new rank after strain
tax_table(x10) <- cbind(tax_table(x10), catglab=mylabels)
# Check this out on a tree
plot_tree(x10, label.tips="catglab", 
                    color="SampleType", shape="Phylum", nodelabf=nodeplotboot(),
                    ladderize="left", plot.margin=2.15, size="abundance")

Does this accomplish what you were asking about?

By the way, the complete "upstream" taxonomy is utilized when performing taxonomy based OTU merges with the tax_glom function. Can be very useful...

joey711 commented 11 years ago

@Mikeyj

I think I answer your question with this example, and existing capabilities in phyloseq. I can re-open this issue if there is still something lingering.

Cheers

joey

Mikeyj commented 11 years ago

Hi Joey,

apologies for not responding to this more quickly, I get to do analysis in fits and starts!

With your above example the output isn't ideal. The tax table of object x10 looks like this:

> tax_table(x10)
Taxonomy Table:     [10 taxa by 9 taxonomic ranks]:
       Kingdom    Phylum           Class                 Order               Family              
329744 "Bacteria" "Actinobacteria" "Actinobacteria"      "Actinomycetales"   "ACK-M1"            
317182 "Bacteria" "Cyanobacteria"  "Chloroplast"         "Stramenopiles"     NA                  
549656 "Bacteria" "Cyanobacteria"  "Chloroplast"         "Stramenopiles"     NA                  
279599 "Bacteria" "Cyanobacteria"  "Nostocophycideae"    "Nostocales"        "Nostocaceae"       
360229 "Bacteria" "Proteobacteria" "Betaproteobacteria"  "Neisseriales"      "Neisseriaceae"     
94166  "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Pasteurellales"    "Pasteurellaceae"   
550960 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales" "Enterobacteriaceae"
158660 "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"     "Bacteroidaceae"    
331820 "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"     "Bacteroidaceae"    
189047 "Bacteria" "Firmicutes"     "Clostridia"          "Clostridiales"     "Ruminococcaceae"   
       Genus            Species                     Strain  
329744 NA               NA                          "329744"
317182 NA               NA                          "317182"
549656 NA               NA                          "549656"
279599 "Dolichospermum" NA                          "279599"
360229 "Neisseria"      NA                          "360229"
94166  "Haemophilus"    "Haemophilusparainfluenzae" "94166" 
550960 "Providencia"    NA                          "550960"
158660 "Bacteroides"    NA                          "158660"
331820 "Bacteroides"    NA                          "331820"
189047 NA               NA                          "189047"
       catglab                                      
329744 "NA_NA_329744"                               
317182 "NA_NA_317182"                               
549656 "NA_NA_549656"                               
279599 "Dolichospermum_NA_279599"                   
360229 "Neisseria_NA_360229"                        
94166  "Haemophilus_Haemophilusparainfluenzae_94166"
550960 "Providencia_NA_550960"                      
158660 "Bacteroides_NA_158660"                      
331820 "Bacteroides_NA_331820"                      
189047 "NA_NA_189047"    

With the last level, catglab containing the new tree labels. The first three and last OTUs have uninformative names and the rest pick up an NA. The actual output that you'd want for the labels would be something like:

ACK-M1_329744
Stramenopiles_317182
Stramenopiles_549656
Dolichospermum_279599
Neisseria_360229
Haemophilus_parainfluenzae_94166
Providencia_550960
Bacteroides_158660
Bacteroides_331820
Ruminococcaceae_189047

Different phylogenetic levels are used for the name each time, so the rule I think would be:

If genus and species present, use genus_species_OTUnumber If absent use family_OTUnumber If absent use order_OTUnumber if absent use class_OTUnumber if absent use phylum_OTUnumber if absent use unclassified_OTUnumber

Resulting in a tree with all unique names, where OTUs can be referenced and all names are as informative as your identification allows.

It would be even better if genus and species were also italicised, but that's just window dressing!

Cheers

Mike

joey711 commented 11 years ago

@Mikeyj

Can you clarify what you are asking for? The object printed to screen from tax_table(x10) in your example is just that: the stuff printed to screen. It's mean for you to interact with the data, not to create tables. R already includes many tools for creating "nice" tables from matrices, which the tax_table(x10) returns, or you can coerce explicitly with as(tax_table(x10), "matrix").

Similarly, there are several options for custom-parsing the rank names or multiple rank names, using standard existing R string handling tools like paste, paste0, gsub, etc. What you appear to be asking for is something specific to the way you want to represent your data taxonomic classification data in a plot, and you are almost there. If there is something that phyloseq could provide that would be generally useful, I am happy to consider it... Are you asking for special node labels? Many folks want to store bootstrap values in their node labels, and the taxonomic classification data is already stored in the table. I'm not sure what is gained by adding it to the tree. It won't be legible.

Mikeyj commented 11 years ago

So I've made little progress with this - documenting in this thread in case it's useful for other users (or I'm tackling this in a stupid way!) if I don't get a chance to finish it for a while.

What I would like is an additional column in the taxa table that has a nice (and unique) OTU name suitable for use in all plots (not just the phylogenetic tree) that refer to an OTU by name. The difficulty is that OTUs won't all have been identified to the same level and there are likely to be plenty of NAs that ruin the niceness.

Using Joey's myranks/catglab code above I've delved into the world of if/else and attempted to create a rule for selecting the best identification that isn't NA for each OTU and use that with Strain (OTU ID) as the nice name. Not currently functional :)

myranks = c(if("Species"!=NA){
    "Species"
} else if("Genus"!=NA){
    "Genus"
} else if("Family"!=NA{
    "Family"
} else if("Order"!=NA){
    "Order"
} else if("Class"!=NA){
    "Class"
} else ("Phylum"!=NA){
    "Phylum"
}, "Strain")
joey711 commented 11 years ago

@Mikeyj

I see what you're trying to do. One issue is that you shouldn't be hard-coding the rank names. There's no requirement in phyloseq that the tax_table ranks have specific column names. This is important so that it can be used in different datasets with different phylogenetic resolutions, and without any particular imposition about rank labels. In this way it also supports other forms of categorical information on taxa.

Mikeyj commented 11 years ago

Thanks Joey - this is stretching my skills a bit, so will hard code initially and then replace with references once I've got my brain round the process. I don't think the above strategy works, I think I need to do this using gsub and apply, outputting another new column that I can cbind to the tax_table.

aaronsaunders commented 11 years ago

I use these functions. I use the greengenes prefixes, hence the if statement to make the tax_name "NA" if it is an empty greengenes prefix. It gets around the whole number of levels and levels names thing.

If I give it a vector of OTU numbers it returns the best informative tax assignment (with OTU numbers as names).

Probably inefficient, but it seems to give me the results I am after...

Aaron

makeTaxLabels <- function(OTUs, mydata){
  # Vector wrapper for "makeLabel" 
  # Makes a string label using the lowest informative tax level
  #
  # Args:
  #   OTUs: vector of OTU numbers
  #   mydata: the phyloseq object with the tax table
  #
  # Returns:
  #   a vector of a tax names
  sapply(OTUs, FUN=makeTaxLabel, mydata)
}

makeTaxLabel <- function(OTU, mydata){
  # Makes a string label using the lowest informative tax level
  #
  # Args:
  #   OTU: OTU number
  #   mydata: the phyloseq object with the tax table
  #
  # Returns:
  #   a tax name
  taxstrings <- as.character(taxTab(mydata)[OTU])
  empty_strings <- c("k__", "p__", "c__", "o__", "f__", "g__", "s__")
  tax_name <- NA
  tax_level <- length(taxstrings)
  while(is.na(tax_name) | (tax_level == 0) ){
    tax_name <- taxstrings[tax_level]
    if (tax_name %in% empty_strings) {
      tax_level <- tax_level - 1
      tax_name <- NA
    }
  }
  tax_name <- as.character(tax_name)
  names(tax_name) <- OTU
  tax_name
}
aaronsaunders commented 11 years ago

I have tried to use the previous code it breaks sometimes. Maybe this is better. I also removed the second last names() line as it doubles up on the names when run through sapply.

There should really be a test dataset... It is slow too. probably the while loop?

Aaron


makeTaxLabel <- function(OTU, mydata){
    # Makes a string label using the lowest informative tax level
    #
    # Args:
    #   OTU: OTU number
    #   mydata: the phyloseq object with the tax table
    #
    # Returns:
    #   a tax name
    OTU <- as.character(OTU)  # the OTU numbers are stored as character not integer!
    taxstrings <- as.character(taxTab(mydata)[OTU])
    empty_strings <- c("k__", "p__", "c__", "o__", "f__", "g__", "s__")
    tax_name <- NA
    tax_level <- length(taxstrings)  # start at lowest tax level

    while(is.na(tax_name) | 
              (tax_name %in% empty_strings)){
        tax_name  <- taxstrings[tax_level]
        tax_level <- tax_level -1
    }
    tax_name
}
Mikeyj commented 11 years ago

Thanks Aaron,

that's got me onto the right track - I have taken your nice loop and added some ugly stupid stuff on the end that achieves what I want. Here goes:

My taxa table looks like this as it's from Silva IDs

> tax_table(x10)
Taxonomy Table:     [10 taxa by 6 taxonomic ranks]:
    Phylum              Class                    Order                Family               
137 "p__Bacteroidetes"  "c__Bacteroidia"         "o__Bacteroidales"   "f__Prevotellaceae"  
619 "p__Actinobacteria" "c__Actinobacteria"      "o__Micrococcales"   "f__Micrococcaceae"  
945 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Moraxellaceae"   
688 "p__Proteobacteria" "c__Betaproteobacteria"  "o__Neisseriales"    "f__Neisseriaceae"   
915 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Pseudomonadaceae"
542 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales"  "f__Pasteurellaceae" 
264 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales"  "f__Pasteurellaceae" 
338 "p__Firmicutes"     "c__Bacilli"             "o__Lactobacillales" "f__Streptococcaceae"
821 "p__Firmicutes"     "c__Clostridia"          "o__Clostridiales"   "f__Veillonellaceae" 
693 "p__Firmicutes"     "c__Bacilli"             "o__Lactobacillales" "f__Streptococcaceae"
    Genus              Species                    
137 "g__Prevotella"    "s__Prevotella_histicola"  
619 "g__Rothia"        "s__Uncultured_bacterium"  
945 "g__Moraxella"     "s__Uncultured_bacterium"  
688 "g__Neisseria"     "s__Uncultured_bacterium"  
915 "g__Pseudomonas"   "s__Pseudomonas_aeruginosa"
542 "g__Haemophilus"   NA                         
264 "g__Haemophilus"   "s__Uncultured_organism"   
338 "g__Streptococcus" "s__Uncultured_bacterium"  
821 "g__Veillonella"   "s__Uncultured_bacterium"  
693 "g__Streptococcus" NA  

Firstly, sometimes stupid Species names are used so I replace those with NA

tax_table(x10) =gsub("s__Uncultured_bacterium", as.character(NA), tax_table(x10))
tax_table(x10) =gsub("s__Uncultured_organism", as.character(NA), tax_table(x10))

Then I run your function over the whole tax_table like this:

mynames = NULL
for (i in 1:length(taxa_names(x10))){
mynames <- rbind(mynames, c(makeTaxLabel(taxa_names(x10)[i],x10)))
}

Because I have the Silva table with the additional prefixes I remove these (incidentally, I think I remember trying the greengenes importer with the Silva format taxonomy and it works, so this step won't be needed when I go back to the very beginning and use that as they've already been stripped)

mynames = gsub("s__", "", mynames)
mynames = gsub("g__", "", mynames)
mynames = gsub("f__", "", mynames)
mynames = gsub("o__", "", mynames)
mynames = gsub("c__", "", mynames)
mynames = gsub("p__", "", mynames)

The result gets the OTU ID suffixed and then the whole lot added to the original tax_table

mynames = str_c(mynames[,1],"_",taxa_names(x10))
tax_table(x10) <- cbind(tax_table(x10), OTUID=mynames)

Done!

tax_table(x10)
Taxonomy Table:     [10 taxa by 7 taxonomic ranks]:
    Phylum              Class                    Order                Family               
137 "p__Bacteroidetes"  "c__Bacteroidia"         "o__Bacteroidales"   "f__Prevotellaceae"  
619 "p__Actinobacteria" "c__Actinobacteria"      "o__Micrococcales"   "f__Micrococcaceae"  
945 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Moraxellaceae"   
688 "p__Proteobacteria" "c__Betaproteobacteria"  "o__Neisseriales"    "f__Neisseriaceae"   
915 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Pseudomonadaceae"
542 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales"  "f__Pasteurellaceae" 
264 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales"  "f__Pasteurellaceae" 
338 "p__Firmicutes"     "c__Bacilli"             "o__Lactobacillales" "f__Streptococcaceae"
821 "p__Firmicutes"     "c__Clostridia"          "o__Clostridiales"   "f__Veillonellaceae" 
693 "p__Firmicutes"     "c__Bacilli"             "o__Lactobacillales" "f__Streptococcaceae"
    Genus              Species                     OTUID                       
137 "g__Prevotella"    "s__Prevotella_histicola"   "Prevotella_histicola_137"  
619 "g__Rothia"        NA                          "Rothia_619"                
945 "g__Moraxella"     NA                          "Moraxella_945"             
688 "g__Neisseria"     NA                          "Neisseria_688"             
915 "g__Pseudomonas"   "s__Pseudomonas_aeruginosa" "Pseudomonas_aeruginosa_915"
542 "g__Haemophilus"   NA                          "Haemophilus_542"           
264 "g__Haemophilus"   NA                          "Haemophilus_264"           
338 "g__Streptococcus" NA                          "Streptococcus_338"         
821 "g__Veillonella"   NA                          "Veillonella_821"           
693 "g__Streptococcus" NA                          "Streptococcus_693"   

Really clunky and definitely doesn't work generally!

I had intended to do the stripping of the taxa level prefixes and the appending of OTU IDs in a single step with str_extract, but spent fruitless hours attempting to find the correct regex syntax to do "keep everything after __"!

joey711 commented 10 years ago

Hey you guys, sorry for the delay. I'm working on some important new features for the package, so I haven't gotten to this. I can definitely show you some nice ways to do this that are stable and general. I'm not sure exactly how/if I would want to implement this in the package, which is why the issue is still closed. I have some idea why you want this, but if you want to provide additional feedback clarifying how this will improve your work with the features in the package, it will help keep this on my radar.

Mikeyj commented 10 years ago

Hi Joey,

Sorry for slow reply. For me it's for clear, simple and informative referencing of OTUs in text, tables and figures.

OTU IDs are the output of almost every comparison: legends for stacked bars (that aren't glommed to higher taxonomic ranks), tip labels of trees as above, tables of OTUs that correlate with an environmental variable, either directly or via species loading scores, OTU locations in biplots etc. In the text of manuscripts you will also often want to draw attention to particular OTUs.

It's important that they're unique so that you can tell them apart. For these purposes, the phylogenetic string is descriptive, but none unique. It's also important that they give readers an idea of what the organism is, even if your ID is not very good and you get really uptight about species concepts and lack thereof, people are better at remembering names than numbers.

Therefore, combining the two should give us an ID that does the job. The whole taxa string is unwieldy, and contains redundant information that can be found in the database you've used for identification. The genus and species names would be enough, but you don't always get them, and sometimes they're stupid (unlcassified, uranium_slag_heap_clone27, etc.), so I'm going for the last, most informative phylogenetic level as the descriptive part of the ID.

If I can get that into presentation format too (genus and species italicized if present, no underscores), then that's even better.