Closed Mikeyj closed 11 years ago
@Mikeyj
So if you were going to use the current framework in phyloseq, which I obviously recommend O:-), you could simply do a row-wise concatenation of the relevant taxonomic information you're after.
You can use the import infrastructure in phyloseq to parse the taxonomy for you. The r__
prefixes in front of your taxonomic elements should be trimmed off, since their position in the taxonomy table already indicates the rank. This is done automatically when you specify parse_taxonomy_greengenes
as the parseFunction
argument in import_biom
.
Try this reproducible example for concatenating the taxonomy as a particular string for plotting. This example contains only 10 taxa so it is fast to reproduce and easy to read, even on a tree created at the end. The first line of the multi-line plot_tree
call includes the argument adding referring to the new concatenated dummy "rank" that I added to the taxonomy table of the phyloseq object, x10
.
library("phyloseq")
data("GlobalPatterns")
x10 = prune_taxa(tail(names(sort(taxa_sums(GlobalPatterns))), 10), GlobalPatterns)
# Fix the node labels for the sake of the tree at the end:
phy_tree(x10)$node.label <- gsub("\\.[[:alnum:]]+$", "", phy_tree(x10)$node.label)
# Add a new rank, Strain, with the OTU ids
tax_table(x10) <- cbind(tax_table(x10), Strain=taxa_names(x10))
# Define the ranks you want to include
myranks = c("Genus", "Species", "Strain")
mylabels = apply(tax_table(x10)[, myranks], 1, paste, sep="", collapse="_")
# Add concatenated labels as a new rank after strain
tax_table(x10) <- cbind(tax_table(x10), catglab=mylabels)
# Check this out on a tree
plot_tree(x10, label.tips="catglab",
color="SampleType", shape="Phylum", nodelabf=nodeplotboot(),
ladderize="left", plot.margin=2.15, size="abundance")
Does this accomplish what you were asking about?
By the way, the complete "upstream" taxonomy is utilized when performing taxonomy based OTU merges with the tax_glom
function. Can be very useful...
@Mikeyj
I think I answer your question with this example, and existing capabilities in phyloseq. I can re-open this issue if there is still something lingering.
Cheers
joey
Hi Joey,
apologies for not responding to this more quickly, I get to do analysis in fits and starts!
With your above example the output isn't ideal. The tax table of object x10 looks like this:
> tax_table(x10)
Taxonomy Table: [10 taxa by 9 taxonomic ranks]:
Kingdom Phylum Class Order Family
329744 "Bacteria" "Actinobacteria" "Actinobacteria" "Actinomycetales" "ACK-M1"
317182 "Bacteria" "Cyanobacteria" "Chloroplast" "Stramenopiles" NA
549656 "Bacteria" "Cyanobacteria" "Chloroplast" "Stramenopiles" NA
279599 "Bacteria" "Cyanobacteria" "Nostocophycideae" "Nostocales" "Nostocaceae"
360229 "Bacteria" "Proteobacteria" "Betaproteobacteria" "Neisseriales" "Neisseriaceae"
94166 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Pasteurellales" "Pasteurellaceae"
550960 "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales" "Enterobacteriaceae"
158660 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
331820 "Bacteria" "Bacteroidetes" "Bacteroidia" "Bacteroidales" "Bacteroidaceae"
189047 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales" "Ruminococcaceae"
Genus Species Strain
329744 NA NA "329744"
317182 NA NA "317182"
549656 NA NA "549656"
279599 "Dolichospermum" NA "279599"
360229 "Neisseria" NA "360229"
94166 "Haemophilus" "Haemophilusparainfluenzae" "94166"
550960 "Providencia" NA "550960"
158660 "Bacteroides" NA "158660"
331820 "Bacteroides" NA "331820"
189047 NA NA "189047"
catglab
329744 "NA_NA_329744"
317182 "NA_NA_317182"
549656 "NA_NA_549656"
279599 "Dolichospermum_NA_279599"
360229 "Neisseria_NA_360229"
94166 "Haemophilus_Haemophilusparainfluenzae_94166"
550960 "Providencia_NA_550960"
158660 "Bacteroides_NA_158660"
331820 "Bacteroides_NA_331820"
189047 "NA_NA_189047"
With the last level, catglab containing the new tree labels. The first three and last OTUs have uninformative names and the rest pick up an NA. The actual output that you'd want for the labels would be something like:
ACK-M1_329744
Stramenopiles_317182
Stramenopiles_549656
Dolichospermum_279599
Neisseria_360229
Haemophilus_parainfluenzae_94166
Providencia_550960
Bacteroides_158660
Bacteroides_331820
Ruminococcaceae_189047
Different phylogenetic levels are used for the name each time, so the rule I think would be:
If genus and species present, use genus_species_OTUnumber If absent use family_OTUnumber If absent use order_OTUnumber if absent use class_OTUnumber if absent use phylum_OTUnumber if absent use unclassified_OTUnumber
Resulting in a tree with all unique names, where OTUs can be referenced and all names are as informative as your identification allows.
It would be even better if genus and species were also italicised, but that's just window dressing!
Cheers
Mike
@Mikeyj
Can you clarify what you are asking for? The object printed to screen from tax_table(x10)
in your example is just that: the stuff printed to screen. It's mean for you to interact with the data, not to create tables. R already includes many tools for creating "nice" tables from matrices, which the tax_table(x10)
returns, or you can coerce explicitly with as(tax_table(x10), "matrix")
.
Similarly, there are several options for custom-parsing the rank names or multiple rank names, using standard existing R string handling tools like paste
, paste0
, gsub
, etc. What you appear to be asking for is something specific to the way you want to represent your data taxonomic classification data in a plot, and you are almost there. If there is something that phyloseq could provide that would be generally useful, I am happy to consider it... Are you asking for special node labels? Many folks want to store bootstrap values in their node labels, and the taxonomic classification data is already stored in the table. I'm not sure what is gained by adding it to the tree. It won't be legible.
So I've made little progress with this - documenting in this thread in case it's useful for other users (or I'm tackling this in a stupid way!) if I don't get a chance to finish it for a while.
What I would like is an additional column in the taxa table that has a nice (and unique) OTU name suitable for use in all plots (not just the phylogenetic tree) that refer to an OTU by name. The difficulty is that OTUs won't all have been identified to the same level and there are likely to be plenty of NAs that ruin the niceness.
Using Joey's myranks/catglab code above I've delved into the world of if/else and attempted to create a rule for selecting the best identification that isn't NA for each OTU and use that with Strain (OTU ID) as the nice name. Not currently functional :)
myranks = c(if("Species"!=NA){
"Species"
} else if("Genus"!=NA){
"Genus"
} else if("Family"!=NA{
"Family"
} else if("Order"!=NA){
"Order"
} else if("Class"!=NA){
"Class"
} else ("Phylum"!=NA){
"Phylum"
}, "Strain")
@Mikeyj
I see what you're trying to do. One issue is that you shouldn't be hard-coding the rank names. There's no requirement in phyloseq that the tax_table
ranks have specific column names. This is important so that it can be used in different datasets with different phylogenetic resolutions, and without any particular imposition about rank labels. In this way it also supports other forms of categorical information on taxa.
Thanks Joey - this is stretching my skills a bit, so will hard code initially and then replace with references once I've got my brain round the process. I don't think the above strategy works, I think I need to do this using gsub and apply, outputting another new column that I can cbind to the tax_table.
I use these functions. I use the greengenes prefixes, hence the if statement to make the tax_name "NA" if it is an empty greengenes prefix. It gets around the whole number of levels and levels names thing.
If I give it a vector of OTU numbers it returns the best informative tax assignment (with OTU numbers as names).
Probably inefficient, but it seems to give me the results I am after...
Aaron
makeTaxLabels <- function(OTUs, mydata){
# Vector wrapper for "makeLabel"
# Makes a string label using the lowest informative tax level
#
# Args:
# OTUs: vector of OTU numbers
# mydata: the phyloseq object with the tax table
#
# Returns:
# a vector of a tax names
sapply(OTUs, FUN=makeTaxLabel, mydata)
}
makeTaxLabel <- function(OTU, mydata){
# Makes a string label using the lowest informative tax level
#
# Args:
# OTU: OTU number
# mydata: the phyloseq object with the tax table
#
# Returns:
# a tax name
taxstrings <- as.character(taxTab(mydata)[OTU])
empty_strings <- c("k__", "p__", "c__", "o__", "f__", "g__", "s__")
tax_name <- NA
tax_level <- length(taxstrings)
while(is.na(tax_name) | (tax_level == 0) ){
tax_name <- taxstrings[tax_level]
if (tax_name %in% empty_strings) {
tax_level <- tax_level - 1
tax_name <- NA
}
}
tax_name <- as.character(tax_name)
names(tax_name) <- OTU
tax_name
}
I have tried to use the previous code it breaks sometimes. Maybe this is better. I also removed the second last names() line as it doubles up on the names when run through sapply.
There should really be a test dataset... It is slow too. probably the while loop?
Aaron
makeTaxLabel <- function(OTU, mydata){
# Makes a string label using the lowest informative tax level
#
# Args:
# OTU: OTU number
# mydata: the phyloseq object with the tax table
#
# Returns:
# a tax name
OTU <- as.character(OTU) # the OTU numbers are stored as character not integer!
taxstrings <- as.character(taxTab(mydata)[OTU])
empty_strings <- c("k__", "p__", "c__", "o__", "f__", "g__", "s__")
tax_name <- NA
tax_level <- length(taxstrings) # start at lowest tax level
while(is.na(tax_name) |
(tax_name %in% empty_strings)){
tax_name <- taxstrings[tax_level]
tax_level <- tax_level -1
}
tax_name
}
Thanks Aaron,
that's got me onto the right track - I have taken your nice loop and added some ugly stupid stuff on the end that achieves what I want. Here goes:
My taxa table looks like this as it's from Silva IDs
> tax_table(x10)
Taxonomy Table: [10 taxa by 6 taxonomic ranks]:
Phylum Class Order Family
137 "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Prevotellaceae"
619 "p__Actinobacteria" "c__Actinobacteria" "o__Micrococcales" "f__Micrococcaceae"
945 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Moraxellaceae"
688 "p__Proteobacteria" "c__Betaproteobacteria" "o__Neisseriales" "f__Neisseriaceae"
915 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Pseudomonadaceae"
542 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales" "f__Pasteurellaceae"
264 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales" "f__Pasteurellaceae"
338 "p__Firmicutes" "c__Bacilli" "o__Lactobacillales" "f__Streptococcaceae"
821 "p__Firmicutes" "c__Clostridia" "o__Clostridiales" "f__Veillonellaceae"
693 "p__Firmicutes" "c__Bacilli" "o__Lactobacillales" "f__Streptococcaceae"
Genus Species
137 "g__Prevotella" "s__Prevotella_histicola"
619 "g__Rothia" "s__Uncultured_bacterium"
945 "g__Moraxella" "s__Uncultured_bacterium"
688 "g__Neisseria" "s__Uncultured_bacterium"
915 "g__Pseudomonas" "s__Pseudomonas_aeruginosa"
542 "g__Haemophilus" NA
264 "g__Haemophilus" "s__Uncultured_organism"
338 "g__Streptococcus" "s__Uncultured_bacterium"
821 "g__Veillonella" "s__Uncultured_bacterium"
693 "g__Streptococcus" NA
Firstly, sometimes stupid Species names are used so I replace those with NA
tax_table(x10) =gsub("s__Uncultured_bacterium", as.character(NA), tax_table(x10))
tax_table(x10) =gsub("s__Uncultured_organism", as.character(NA), tax_table(x10))
Then I run your function over the whole tax_table like this:
mynames = NULL
for (i in 1:length(taxa_names(x10))){
mynames <- rbind(mynames, c(makeTaxLabel(taxa_names(x10)[i],x10)))
}
Because I have the Silva table with the additional prefixes I remove these (incidentally, I think I remember trying the greengenes importer with the Silva format taxonomy and it works, so this step won't be needed when I go back to the very beginning and use that as they've already been stripped)
mynames = gsub("s__", "", mynames)
mynames = gsub("g__", "", mynames)
mynames = gsub("f__", "", mynames)
mynames = gsub("o__", "", mynames)
mynames = gsub("c__", "", mynames)
mynames = gsub("p__", "", mynames)
The result gets the OTU ID suffixed and then the whole lot added to the original tax_table
mynames = str_c(mynames[,1],"_",taxa_names(x10))
tax_table(x10) <- cbind(tax_table(x10), OTUID=mynames)
Done!
tax_table(x10)
Taxonomy Table: [10 taxa by 7 taxonomic ranks]:
Phylum Class Order Family
137 "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales" "f__Prevotellaceae"
619 "p__Actinobacteria" "c__Actinobacteria" "o__Micrococcales" "f__Micrococcaceae"
945 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Moraxellaceae"
688 "p__Proteobacteria" "c__Betaproteobacteria" "o__Neisseriales" "f__Neisseriaceae"
915 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pseudomonadales" "f__Pseudomonadaceae"
542 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales" "f__Pasteurellaceae"
264 "p__Proteobacteria" "c__Gammaproteobacteria" "o__Pasteurellales" "f__Pasteurellaceae"
338 "p__Firmicutes" "c__Bacilli" "o__Lactobacillales" "f__Streptococcaceae"
821 "p__Firmicutes" "c__Clostridia" "o__Clostridiales" "f__Veillonellaceae"
693 "p__Firmicutes" "c__Bacilli" "o__Lactobacillales" "f__Streptococcaceae"
Genus Species OTUID
137 "g__Prevotella" "s__Prevotella_histicola" "Prevotella_histicola_137"
619 "g__Rothia" NA "Rothia_619"
945 "g__Moraxella" NA "Moraxella_945"
688 "g__Neisseria" NA "Neisseria_688"
915 "g__Pseudomonas" "s__Pseudomonas_aeruginosa" "Pseudomonas_aeruginosa_915"
542 "g__Haemophilus" NA "Haemophilus_542"
264 "g__Haemophilus" NA "Haemophilus_264"
338 "g__Streptococcus" NA "Streptococcus_338"
821 "g__Veillonella" NA "Veillonella_821"
693 "g__Streptococcus" NA "Streptococcus_693"
Really clunky and definitely doesn't work generally!
I had intended to do the stripping of the taxa level prefixes and the appending of OTU IDs in a single step with str_extract, but spent fruitless hours attempting to find the correct regex syntax to do "keep everything after __"!
Hey you guys, sorry for the delay. I'm working on some important new features for the package, so I haven't gotten to this. I can definitely show you some nice ways to do this that are stable and general. I'm not sure exactly how/if I would want to implement this in the package, which is why the issue is still closed. I have some idea why you want this, but if you want to provide additional feedback clarifying how this will improve your work with the features in the package, it will help keep this on my radar.
Hi Joey,
Sorry for slow reply. For me it's for clear, simple and informative referencing of OTUs in text, tables and figures.
OTU IDs are the output of almost every comparison: legends for stacked bars (that aren't glommed to higher taxonomic ranks), tip labels of trees as above, tables of OTUs that correlate with an environmental variable, either directly or via species loading scores, OTU locations in biplots etc. In the text of manuscripts you will also often want to draw attention to particular OTUs.
It's important that they're unique so that you can tell them apart. For these purposes, the phylogenetic string is descriptive, but none unique. It's also important that they give readers an idea of what the organism is, even if your ID is not very good and you get really uptight about species concepts and lack thereof, people are better at remembering names than numbers.
Therefore, combining the two should give us an ID that does the job. The whole taxa string is unwieldy, and contains redundant information that can be found in the database you've used for identification. The genus and species names would be enough, but you don't always get them, and sometimes they're stupid (unlcassified, uranium_slag_heap_clone27, etc.), so I'm going for the last, most informative phylogenetic level as the descriptive part of the ID.
If I can get that into presentation format too (genus and species italicized if present, no underscores), then that's even better.
Hi Joey,
suggestion for consideration rather than bug.
Sometimes if I label OTUs on a plot using Genus there are labels missing due to that particular taxonomic level being absent for that OTU. In addition it is also possible to have multiple distinct OTUs with the same name. This can make referencing particular OTUs of interest difficult in the text of papers.
How I've got round this in the past is to create a unique name for each OTU that includes the OTU number. In addition I use the best taxonomic level available for any particular OTU. For example:
In the current scheme your OTUs at the Genus level would be called:
gPseudmonas gPseudmonas