joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

import QIIME and import biom parse taxa warnings #266

Closed laurenms closed 10 years ago

laurenms commented 10 years ago

Hi, I'm trying to import data from QIIME using import_qiime or import_biom. Here are the versions I'm using, my code and the errors:

QIIME version: MacQIIME 1.7.0-20130523 R version: R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"

packageVersion("phyloseq") [1] ‘1.7.8’

otufile=....path to otu_table_tabseparated.txt created by pick_de_novo_otus.py in QIIME mapfile=...path to map file trefile= path to rep_set.tre created by pick_de_novo_otus.py make_phylogeny.py in QIIME

Here's what happens when I run import_qiime:

iqdata=import_qiime(otufilename=otufile, mapfilename=mapfile, treefilename=trefile, parseFunction=parse_taxonomy_greengenes) Processing map file... Processing otu/tax file...

Reading and parsing file in chunks ... Could take some time. Please be patient...

Building OTU Table in chunks. Each chunk is one dot. ............Building Taxonomy Table... Processing phylogenetic tree... There were 50 or more warnings (use warnings() to see the first 50)

warnings() Warning messages: 1: In FUN(X[[11746L]], ...) : No greengenes prefixes were found. Consider using parse_taxonomy_default() instead if true for all OTUs. Dummy ranks may be included among taxonomic ranks now. 2: In FUN(X[[11746L]], ...) : No greengenes prefixes were found. Consider using parse_taxonomy_default() instead if true for all OTUs. Dummy ranks may be included among taxonomic ranks now.

all 50 warnings are the same. so i did the following to investigate further...

iqdata phyloseq-class experiment-level object otu_table() OTU Table: [ 9723 taxa and 19 samples ] sample_data() Sample Data: [ 19 samples by 8 sample variables ] tax_table() Taxonomy Table: [ 9723 taxa by 1 taxonomic ranks ] phy_tree() Phylogenetic Tree: [ 9723 tips and 9721 internal nodes ]

tax_table(iqdata) returns denovo numbers and the taxonomy. for example, denovo7075 "kBacteria; pFirmicutes; cClostridia; oClostridiales; f__Lachnospiraceae"

rank_names(iqdata) [1] "Kingdom" "Rank1"

but

tn=taxa_names(iqdata) tn[1:20] [1] "denovo6848" "denovo4863" "denovo5672" "denovo11027" "denovo8010" "denovo2895" "denovo4833" "denovo9824" [9] "denovo8313" "denovo10777" "denovo5302" "denovo7892" "denovo2916" "denovo11289" "denovo10781" "denovo4820" [17] "denovo8231" "denovo5848" "denovo11349" "denovo6710"

tt=tax_table(iqdata) tt[1:10] Taxonomy Table: [10 taxa by 2 taxonomic ranks]: Kingdom Rank1 denovo6848 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; ; " NA
denovo4863 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae" NA
denovo5672 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae" NA
denovo11027 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; ; " NA
denovo8010 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; ; " NA
denovo2895 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; ; " NA
denovo4833 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; ; " NA
denovo9824 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; [Ruminococcus]; " NA
denovo8313 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; ; " NA
denovo10777 "Bacteria; Firmicutes; Clostridia; Clostridiales; Lachnospiraceae; ; " NA

If I run import_qiime with parseFunction=parse_taxonomy_default then I don't get any warnings but the taxa data is not there.

tn=taxa_names(iqdata) tn[1:10] [1] "denovo6848" "denovo4863" "denovo5672" "denovo11027" "denovo8010" "denovo2895" "denovo4833" "denovo9824" "denovo8313" "denovo10777"

What should I try next? Thanks! Lauren

joey711 commented 10 years ago

Hi Lauren!

Thanks for the feedback. I can't tell exactly what the issue is from your post. My best guess is that there is something unexpected in the format of your taxonomy table. I find it strange because the taxonomy strings are usually semicolon delimited, and some of your example taxonomy strings after parsing still have semicolons. Usually the generic rank names, like "Rank_1", show up when greengenes prefixes are absent or not given. This may even be expected with de novo OTU clusters if they are not given any reference classification at all. However, again, I can't tell from your output, alone.

What to try next: Hopefully the ranks are consistent across the classification, even without assigning formal names to each column automatically.

Try if greengenes parsing works (assuming these are greengenes taxonomic classifications):

import_qiime(..., parseFunction=parse_taxonomy_greengenes)

or if not, you can try the most generic:

import_qiime(..., parseFunction=parse_taxonomy_default)

The former assumes that each taxonomy string is already split in the file by its standard delimiter (tab?) and will detect/clip/label greengenes prefixes if present. The latter doesn't assume anything, other than that the taxonomy is also already split. Old "legacy" QIIME output format has taxonomy strings delimited by semi-colons and not separated by a tab like the counts and other fields in the table. This is the main reason for the distinction.

It is always possible to define your own taxonomy parsing function, and you can start with the three available in phyloseq as a starting point. For instance, parse_taxonomy_qiime is just:

parse_taxonomy_qiime <- function (char.vec){
     parse_taxonomy_greengenes(strsplit(char.vec, ";", TRUE)[[1]])
}

That is, all it does is split the semicolon delimited string and pass the results to parse_taxonomy_greengenes. From your output it looks like there is a problem with this splitting step, because you are only ending up with 2 taxonomic ranks instead of 5 or more.

Hope that helps

joey

laurenms commented 10 years ago

Thanks for the reply Joey!

I've tried all three parsing functions with import_qiime and import_biom but no luck. I haven't tried creating my own yet because I'm not sure what I would tell it to parse by. How can I check the formatting of the taxonomy table? Which file from QIIME contains the taxonomy table and is there a way to look at it?

I imported the otufile (labeled by QIIME as otu_table_tabseparated.txt) into excel to get a better look at how the taxa are formatted. The Consensus Lineage column has entries that look like "kBacteria; pFirmicutes; cClostridia; oClostridiales; fLachnospiraceae; gRoseburia." Does this look how you would expect?

I'm concerned I'm trying to import the wrong files. For the treefilename input, I'm using the rep_set.tre file created by pick_de_novo_otus.py in QIIME. Is this right? I'm not sure if I should be using the seqs_rep_set_aligned_pfiltered.fasta file in the pynast_aligned_seqs folder created as part of the pick_de_novo_otus.py? Is this a file that should go in the refseqfilename option?

Thanks again for your help and patience with beginner questions! lauren

On Nov 17, 2013, at 9:52 PM, Paul J. McMurdie wrote:

Hi Lauren!

Thanks for the feedback. I can't tell exactly what the issue is from your post. My best guess is that there is something unexpected in the format of your taxonomy table. I find it strange because the taxonomy strings are usually semicolon delimited, and some of your example taxonomy strings after parsing still have semicolons. Usually the generic rank names, like "Rank_1", show up when greengenes prefixes are absent or not given. This may even be expected with de novo OTU clusters if they are not given any reference classification at all. However, again, I can't tell from your output, alone.

What to try next: Hopefully the ranks are consistent across the classification, even without assigning formal names to each column automatically.

Try if greengenes parsing works (assuming these are greengenes taxonomic classifications):

import_qiime(..., parseFunction=parse_taxonomy_greengenes) or if not, you can try the most generic:

import_qiime(..., parseFunction=parse_taxonomy_default) The former assumes that each taxonomy string is already split in the file by its standard delimiter (tab?) and will detect/clip/label greengenes prefixes if present. The latter doesn't assume anything, other than that the taxonomy is also already split. Old "legacy" QIIME output format has taxonomy strings delimited by semi-colons and not separated by a tab like the counts and other fields in the table. This is the main reason for the distinction.

It is always possible to define your own taxonomy parsing function, and you can start with the three available in phyloseq as a starting point. For instance, parse_taxonomy_qiime is just:

parse_taxonomy_qiime <- function (char.vec){ parse_taxonomy_greengenes(strsplit(char.vec, ";", TRUE)[[1]]) } That is, all it does is split the semicolon delimited string and pass the results to parse_taxonomy_greengenes. From your output it looks like there is a problem with this splitting step, because you are only ending up with 2 taxonomic ranks instead of 5 or more.

Hope that helps

joey

— Reply to this email directly or view it on GitHub.

joey711 commented 10 years ago

If it is a legacy QIIME file than the taxonomy string is usually in the OTU table. Typically the right-most column.

You used the latest version of QIIME, which also gives you the option to write the data in biom-format, which might work better. However, I think there is something else going on. I think you said you looked at the taxonomy string(s) from the OTU table. It seems like there is something strange going on with at least a subset of these for it to only split into two taxonomic ranks. Scan through these in excel and see if you can find entries that don't follow the expected semicolon-delimited format, are empty, or have other issues. In principal this should matter to the parser and import functions in phyloseq, so this is potentially a new issue I haven't seen.

You could also just export the taxonomy column/portion yourself, and have excel split it on the semicolons, then save the split version as a tab-delimited file, and read that into R as a table with OTU IDs as rownames. This table in R would be suitable for tax_table() and to add to the rest of your data as the tax_table component. See examples for tax_table and some related recent Issues about manually importing data components if you want to try this route.

Cheers

joey

laurenms commented 10 years ago

Hi Joey, Thanks for your patience and suggestions. I still can't get everything to load correctly but I think I'm getting closer.

Here is my code (as you suggested I tried using the otu biom):

otubiom="... path to ... otu_table.biom" file <-import_biom(otubiom) map <-import_qiime_sampleData(mapfile) treefile <- read_tree("... path to ... rep_set.tre")

merged_data <-merge_phyloseq(file,map,treefile)

merged_data

phyloseq-class experiment-level object otu_table() OTU Table: [ 9723 taxa and 19 samples ] sample_data() Sample Data: [ 19 samples by 8 sample variables ] tax_table() Taxonomy Table: [ 9723 taxa by 7 taxonomic ranks ] phy_tree() Phylogenetic Tree: [ 9723 tips and 9721 internal nodes ]

rank_names(merged_data) [1] "Rank1" "Rank2" "Rank3" "Rank4" "Rank5" "Rank6" "Rank7"

So now all the ranks are recognized (instead of 1 before) but they are NOT associated with Phylum, Order, etc. even though the taxonomy table has the information (see below).

taxtable=tax_table(merged_data) taxtable[1:10] Taxonomy Table: [10 taxa by 7 taxonomic ranks]: Rank1 Rank2 Rank3 Rank4 Rank5
denovo6848 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo4863 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo5672 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo11027 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo8010 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo2895 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo4833 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo9824 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo8313 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" denovo10777 "kBacteria" "pFirmicutes" "cClostridia" "oClostridiales" "fLachnospiraceae" Rank6 Rank7 denovo6848 "g" "s" denovo4863 NA NA
denovo5672 NA NA
denovo11027 "g" "s" denovo8010 "g" "s" denovo2895 "g" "s" denovo4833 "g" "s" denovo9824 "g[Ruminococcus]" "s" denovo8313 "g" "s" denovo10777 "g" "s"

When I tried to plot: plot_bar(run1, x = "X.SampleID", fill = "Rank3") I got a plot that had a color bar of taxa like c__Gammaproteobacteria. But plot_bar(run1, x = "X.SampleID", fill = "Class") did not work.

Any ideas about what to try next?

I looked through the Consensus Lineage column (the last in the otu table with the taxonomy info). None of them are blank but some of them are "Unclassified" or just "k__Bacteria" with no other taxa designation. Is this a problem for phyloseq functions that load the data? Should I change or remove the label "Unclassified" ? I also tried what you suggested and created a tab separated file in excel that has just the OTU ID and then the taxonomy divided up by taxa (by column). When I tried to use tax_table() I got the following message Error in access(object, "tax_table", errorIfNULL) : object 'errorIfNULL' not found

I'm really excited about phyloseq but I used all the basic commands for processing 454 data in QIIME and can't figure out how to get them into phyloseq in a usable way so there must be something I'm missing :( Thanks, lauren

On Nov 18, 2013, at 11:29 AM, Paul J. McMurdie wrote:

If it is a legacy QIIME file than the taxonomy string is usually in the OTU table. Typically the right-most column.

You used the latest version of QIIME, which also gives you the option to write the data in biom-format, which might work better. However, I think there is something else going on. I think you said you looked at the taxonomy string(s) from the OTU table. It seems like there is something strange going on with at least a subset of these for it to only split into two taxonomic ranks. Scan through these in excel and see if you can find entries that don't follow the expected semicolon-delimited format, are empty, or have other issues. In principal this should matter to the parser and import functions in phyloseq, so this is potentially a new issue I haven't seen.

You could also just export the taxonomy column/portion yourself, and have excel split it on the semicolons, then save the split version as a tab-delimited file, and read that into R as a table with OTU IDs as rownames. This table in R would be suitable for tax_table() and to add to the rest of your data as the tax_table component. See examples for tax_table and some related recent Issues about manually importing data components if you want to try this route.

Cheers

joey

— Reply to this email directly or view it on GitHub.

aaronsaunders commented 10 years ago

Joey wrote a guide to setting the names of the generic taxa levels here:

https://github.com/joey711/phyloseq/issues/162

maybe that helps.

laurenms commented 10 years ago

Thanks Aaron! I'll look at it...

On Nov 19, 2013, at 12:21 AM, Aaron Marc Saunders wrote:

Joey wrote a guide to setting the names of the generic taxa levels here:

162

maybe that helps.

— Reply to this email directly or view it on GitHub.

joey711 commented 10 years ago

Lauren, your merged_data object print to R standard out as a phyloseq-class object with lots of OTUs and samples. Looks like you're done importing :)

Aaron is exactly right about the generic taxa levels (thanks for posting the link, Aaron!).

I can't promise that it's completely consistent in your data, but from the bit of the taxonomy table that you posted above, the following would fix your rank_names to be what you want:

colnames(tax_table(merged_data)) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
# Check that it worked
rank_names(merged_data)

Then downstream functions that can take a rank name as optional argument should work as you expected. Note that the function call that you posted as "failed" above would have worked with the appropriate data and rank name (before you changed them with the assignment line above). So it's not surprise that before you renamed the ranks, the following worked

plot_bar(merged_data, x = "X.SampleID", fill = "Rank3")

but this didn't

plot_bar(merged_data, x = "X.SampleID", fill = "Class")

After you rename the ranks, the latter will work, and the former won't. Make sense?

Thanks for the feedback. I think it is helpful/encouraging to others.

joey

laurenms commented 10 years ago

Thanks Joey!!

On Nov 19, 2013, at 12:54 PM, Paul J. McMurdie wrote:

Lauren, your merged_data object print to R standard out as a phyloseq-class object with lots of OTUs and samples. Looks like you're done importing :)

Aaron is exactly right about the generic taxa levels (thanks for posting the link, Aaron!).

I can't promise that it's completely consistent in your data, but from the bit of the taxonomy table that you posted above, the following would fix your rank_names to be what you want:

colnames(tax_table(merged_data)) <- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

Check that it worked

rank_names(merged_data) Then downstream functions that can take a rank name as optional argument should work as you expected. Note that the function call that you posted as "failed" above would have worked with the appropriate data and rank name (before you changed them with the assignment line above). So it's not surprise that before you renamed the ranks, the following worked

plot_bar(merged_data, x = "X.SampleID", fill = "Rank3") but this didn't

plot_bar(merged_data, x = "X.SampleID", fill = "Class") After you rename the ranks, the latter will work, and the former won't. Make sense?

Thanks for the feedback. I think it is helpful/encouraging to others.

joey

— Reply to this email directly or view it on GitHub.

joey711 commented 10 years ago

Glad it worked :)