Closed kfontanez closed 10 years ago
Hi Kristina,
Can you provide me a copy of the file that causes this error (but can load with the biom package)? The email address listed in the phyloseq package documentation will work fine.
Also, what version of phyloseq are you using?
joey
Great, thanks! I sent you the file.
I currently have a very similar problem to Kristina's, so I figured I'd add it here.
I created a BIOM file using a Perl script, and can successfully read / manipulate it using QIIME code. Also, previous versions of the BIOM file used to load alright in phyloseq. However, after I added sample metadata to the file, phyloseq no longer loads it. This is the error that I get:
Error in validObject(.Object) : invalid class “phyloseq” object:
Component sample names do not match.
Try sample_names()
The sample metadata part of my BIOM file looks like this:
"columns": [
{"id": "1", "metadata":{"Animal":"1",
"Treatment":"XXX",
"Timepoint":"0",
"CombinedGroup":"1.0.XXX"
}},
{"id": "2", "metadata":{"Animal":"2",
"Treatment":"YYY",
"Timepoint":"0",
"CombinedGroup":"2.0.YYY"
}},
…
{"id": "80", "metadata":{"Animal":"32",
"Treatment":"YYY",
"Timepoint":"2",
"CombinedGroup":"32.2.YYY"
}}
]
Thanks in advance for any help / input you can provide :-)
*Fleury
The point is that the sample_names
in the data.frame
that results from sample_metatdata()
need to match the sample (column) names in the observation matrix. Having non-matching indices is something that is not allowed for phyloseq, but might be tolerated to some extent in biom format generally (though I could be wrong). For the time-being I think it is true that the biom-format is a little less restrictive. In any case, see if you can successfully get those tables out of your biom files without loading phyloseq at all. If so, then you can check the indices, and build the phyloseq-objects by their components with a single call to the phyloseq()
function. Please report back what you found. I want to make sure there is not actually a bug in phyloseq. I doubt it is a coincidence that you both created these biom-format files through non-standard means. I don't know exactly what is wrong yet, but the inspection I've exemplified below will help us all figure it out.
library("biom")
rich_sparse_file = system.file("extdata", "rich_sparse_char.biom", package = "biom")
rich_sparse_file
## [1] "/Library/Frameworks/R.framework/Versions/3.0/Resources/library/biom/extdata/rich_sparse_char.biom"
biom = read_biom(rich_sparse_file)
biom_shape(biom)
## nrow ncol
## 5 6
observation_metadata(biom)
## taxonomy1 taxonomy2 taxonomy3
## GG_OTU_1 k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## GG_OTU_2 k__Bacteria p__Cyanobacteria c__Nostocophycideae
## GG_OTU_3 k__Archaea p__Euryarchaeota c__Methanomicrobia
## GG_OTU_4 k__Bacteria p__Firmicutes c__Clostridia
## GG_OTU_5 k__Bacteria p__Proteobacteria c__Gammaproteobacteria
## taxonomy4 taxonomy5 taxonomy6
## GG_OTU_1 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia
## GG_OTU_2 o__Nostocales f__Nostocaceae g__Dolichospermum
## GG_OTU_3 o__Methanosarcinales f__Methanosarcinaceae g__Methanosarcina
## GG_OTU_4 o__Halanaerobiales f__Halanaerobiaceae g__Halanaerobium
## GG_OTU_5 o__Enterobacteriales f__Enterobacteriaceae g__Escherichia
## taxonomy7
## GG_OTU_1 s__
## GG_OTU_2 s__
## GG_OTU_3 s__
## GG_OTU_4 s__Halanaerobiumsaccharolyticum
## GG_OTU_5 s__
sample_metadata(biom)
## BarcodeSequence LinkerPrimerSequence BODY_SITE Description
## Sample1 CGCTTATCGAGA CATGCTGCCTCCCGTAGGAGT gut human gut
## Sample2 CATACCAGTAGC CATGCTGCCTCCCGTAGGAGT gut human gut
## Sample3 CTCTCTACCTGT CATGCTGCCTCCCGTAGGAGT gut human gut
## Sample4 CTCTCGGCCTGT CATGCTGCCTCCCGTAGGAGT skin human skin
## Sample5 CTCTCTACCAAT CATGCTGCCTCCCGTAGGAGT skin human skin
## Sample6 CTAACTACCAAT CATGCTGCCTCCCGTAGGAGT skin human skin
So, I've tried different things in the meantime on my BIOM table.
(i) I tried loading it using the biom package in R -> constantly fails with the same error message:
Error in validObject(.Object) :
invalid class "biom" object: type field has unsupported value
…which is rather cryptic to me.
(ii) I modified several things about my BIOM file, mostly regarding the sample metadata entries, and tried re-loading it using the biom R package. Always fails, I have no clue really what else I could modify. I tried:
-> playing around spaces/non-spaces and quotes vs non-quotes in the metadata lines, like so:
"Treatment":"XXX",
vs
"Timepoint": 0,
…but to no avail
-> entering all the metadata fields that QIIME requests by default ("BarcodeSequence" etc). Didn't change anything.
-> changing the "format" field to "1.0.0", because "0.9*" seemed to annoy some QIIME scripts; but again, that didn't do anything.
Even though I used a custom Perl script to generate the BIOM file, I think that the format is OK as QIIME prints library stats without complaints and rightly reports sample metadata fields:
Sample Metadata Categories: Timepoint; Treatment; Animal; CombinedGroup
For the time being, I think I will stick to good-old scripting to analyze my data. I thought I'd give BIOM and phyloseq a go, but I realize that this generates more hassle for my data processing than it makes my life easier. I think you're doing a great thing with the phyloseq package, but for the level of flexibility I require (which includes non-QIIME preprocessing and thus custom scripts to make a BIOM file…), it is currently not yet right for me.
Keep up the good work! Best,
*Fleury
Fleury,
Before you give up on the biom-format, and then on phyloseq as a collateral, did you try using the main biom-format tools from the biom-format project, which are written in python (and biom-format is a JSON format, so you might check if there are some syntactical errors there)? They might be able to give you a more clear diagnostic about what is wrong with your biom file. The fact that QIIME can read library stats from your file doesn't tell me very much, and doesn't validate the file format.
Finally, and this is really important, phyloseq is not wedded to QIIME or the biom-format. By design. The notion of "I want to be flexible about my sequence processing and downstream analysis" is exactly a reason to use phyloseq. Not the other way around. If you're having trouble importing a biom format, you can just import your data into R as tables. I spent a fair amount of time creating the data infrastructure necessary to allow a user to relate their data tables as a phyloseq-object, and the relevant functions to look at are phyloseq
and merge_phyloseq
. Although phyloseq
is probably what you want. If you're into "scripting", you should find R really useful, and importing data tables into R really easy. The only non-table you might want right away is a phylogenetic tree, which isn't yet supported by biom-format, anyway. The read_tree
function in phyloseq will import that for you.
Hope that helps. Best of luck
joey
Joey-
So, I followed your directions and was able to recreate your output. However, I'm not sure how that helps to solve my issue of importing my OTU table into phyloseq. In your last e-mail to Fleury you mentioned that it is possible to import an OTU table directly into the phyloseq package. Can you post explicit directions for how that is done? If I can avoid the biom format entirely, all the better. The OTU table originally didn't have the otu column, which I added when trying to convert to biom format. It used to look like taxonomy/150L/150D/ etc.
My starting OTU table is of the format: otu 150L 150D 200D 300L 300D 500L 500D taxonomy otu1 468035 1185 330 237111 94 232341 194 Alteromonas otu2 54696 465193 56075 13513 24703 6713 1446 Vibrio otu3 327010 2522 1288 99880 1193 10119 934 Pseudoalteromonas otu4 276939 615 783 145627 106 1829 486 Marinobacter ...
And my metadata file looks like: 150L 150 live 150D 150 dead 200D 200 dead 300L 300 live 300D 300 dead 500L 500 live 500D 500 dead
I have attached them both to this email for your consideration.
Thank you, Kristina
On Oct 30, 2013, at 6:20 PM, Paul J. McMurdie wrote:
Something for you both to try using biom-package:
library("biom") rich_sparse_file = system.file("extdata", "rich_sparse_char.biom", package = "biom") rich_sparse_file
biom = read_biom(rich_sparse_file) biom_shape(biom)
observation_metadata(biom)
sample_metadata(biom)
— Reply to this email directly or view it on GitHub. 150L 150 live 150D 150 dead 200D 200 dead 300L 300 live 300D 300 dead 500L 500 live 500D 500 dead otu 150L 150D 200D 300L 300D 500L 500D taxonomy otu1 468035 1185 330 237111 94 232341 194 Alteromonas otu2 54696 465193 56075 13513 24703 6713 1446 Vibrio otu3 327010 2522 1288 99880 1193 10119 934 Pseudoalteromonas otu4 276939 615 783 145627 106 1829 486 Marinobacter otu5 64967 222 170 196424 47 477 218 Methylophaga otu6 65448 300 181 88196 35 15495 203 Alcanivorax otu7 2689 1333 2354 21578 553 51619 5377 Candidatus Pelagibacter otu8 34502 1261 344 22313 151 17892 404 Glaciecola ...
How to use manually-imported tables in R and combine them together in a phyloseq object. We'll create the example vanilla R tables using base R code. No packages required yet.
# pretend OTU table that you read from a file, called otumat
otumat = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10)
otumat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 4 41 60 84 33 31 70 90 71 30
## [2,] 69 59 14 75 92 69 26 30 90 54
## [3,] 63 66 3 23 17 89 84 95 35 81
## [4,] 4 25 69 88 30 35 14 36 72 18
## [5,] 53 35 3 20 18 53 56 60 84 1
## [6,] 35 97 15 41 44 26 55 55 20 6
## [7,] 35 80 10 33 95 60 17 27 13 2
## [8,] 83 70 89 21 42 49 59 45 35 33
## [9,] 47 55 91 59 16 54 33 61 47 32
## [10,] 2 51 24 19 59 69 24 88 76 98
# It needs sample names and OTU names, the index names of the matrix Your
# table might already have this
rownames(otumat) <- paste0("OTU", 1:nrow(otumat))
colnames(otumat) <- paste0("Sample", 1:ncol(otumat))
otumat
## Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
## OTU1 4 41 60 84 33 31 70 90
## OTU2 69 59 14 75 92 69 26 30
## OTU3 63 66 3 23 17 89 84 95
## OTU4 4 25 69 88 30 35 14 36
## OTU5 53 35 3 20 18 53 56 60
## OTU6 35 97 15 41 44 26 55 55
## OTU7 35 80 10 33 95 60 17 27
## OTU8 83 70 89 21 42 49 59 45
## OTU9 47 55 91 59 16 54 33 61
## OTU10 2 51 24 19 59 69 24 88
## Sample9 Sample10
## OTU1 71 30
## OTU2 90 54
## OTU3 35 81
## OTU4 72 18
## OTU5 84 1
## OTU6 20 6
## OTU7 13 2
## OTU8 35 33
## OTU9 47 32
## OTU10 76 98
# Now we need a pretend taxonomy table
taxmat = matrix(sample(letters, 70, replace = TRUE), nrow = nrow(otumat), ncol = 7)
rownames(taxmat) <- rownames(otumat)
colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus",
"Species")
taxmat
## Domain Phylum Class Order Family Genus Species
## OTU1 "k" "m" "c" "f" "h" "k" "x"
## OTU2 "r" "t" "f" "z" "u" "k" "e"
## OTU3 "g" "e" "o" "l" "f" "s" "x"
## OTU4 "k" "c" "d" "j" "y" "y" "c"
## OTU5 "q" "c" "p" "p" "s" "w" "h"
## OTU6 "i" "r" "v" "t" "z" "x" "n"
## OTU7 "i" "u" "h" "n" "a" "x" "a"
## OTU8 "r" "a" "c" "i" "h" "z" "w"
## OTU9 "a" "e" "q" "o" "f" "q" "b"
## OTU10 "u" "w" "o" "e" "y" "m" "e"
class(otumat)
## [1] "matrix"
class(taxmat)
## [1] "matrix"
Note how these are just vanilla R matrices. Now let's tell phyloseq how to combine them into a phyloseq object.
library("phyloseq")
OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
OTU
## OTU Table: [10 taxa and 10 samples]
## taxa are rows
## Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
## OTU1 4 41 60 84 33 31 70 90
## OTU2 69 59 14 75 92 69 26 30
## OTU3 63 66 3 23 17 89 84 95
## OTU4 4 25 69 88 30 35 14 36
## OTU5 53 35 3 20 18 53 56 60
## OTU6 35 97 15 41 44 26 55 55
## OTU7 35 80 10 33 95 60 17 27
## OTU8 83 70 89 21 42 49 59 45
## OTU9 47 55 91 59 16 54 33 61
## OTU10 2 51 24 19 59 69 24 88
## Sample9 Sample10
## OTU1 71 30
## OTU2 90 54
## OTU3 35 81
## OTU4 72 18
## OTU5 84 1
## OTU6 20 6
## OTU7 13 2
## OTU8 35 33
## OTU9 47 32
## OTU10 76 98
TAX
## Taxonomy Table: [10 taxa by 7 taxonomic ranks]:
## Domain Phylum Class Order Family Genus Species
## OTU1 "k" "m" "c" "f" "h" "k" "x"
## OTU2 "r" "t" "f" "z" "u" "k" "e"
## OTU3 "g" "e" "o" "l" "f" "s" "x"
## OTU4 "k" "c" "d" "j" "y" "y" "c"
## OTU5 "q" "c" "p" "p" "s" "w" "h"
## OTU6 "i" "r" "v" "t" "z" "x" "n"
## OTU7 "i" "u" "h" "n" "a" "x" "a"
## OTU8 "r" "a" "c" "i" "h" "z" "w"
## OTU9 "a" "e" "q" "o" "f" "q" "b"
## OTU10 "u" "w" "o" "e" "y" "m" "e"
physeq = phyloseq(OTU, TAX)
physeq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10 taxa and 10 samples ]
## tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
plot_bar(physeq, fill = "Family")
Joey-
I was able to follow your directions to create the phyloseq object, thank you! I have been plotting ordinations using the plot_ordination function and I noticed that the text labels produced are really tiny. I tried changing them using theme_update to ggplot2 but was unable to find the correct element to change. How does one do this?
Functions to make plot and attached plot example with tiny labels below. I’d like to make these bigger. If I use geom_point(size=3) then the size of the circles completely overlaps the text label (which is too tiny to read anyway).
plot_ordination(Bacteria,AllBacteriacca,"samples",color="TREATMENT",label="DEPTH”)
Thanks, Kristina
On Oct 31, 2013, at 4:36 PM, Paul J. McMurdie notifications@github.com wrote:
phyloseq() example
How to use manually-imported tables in R and combine them together in a phyloseq object. We'll create the example vanilla R tables using base R code. No packages required yet.
pretend OTU table that you read from a file, called otumat
otumat = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10) otumat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 4 41 60 84 33 31 70 90 71 30
[2,] 69 59 14 75 92 69 26 30 90 54
[3,] 63 66 3 23 17 89 84 95 35 81
[4,] 4 25 69 88 30 35 14 36 72 18
[5,] 53 35 3 20 18 53 56 60 84 1
[6,] 35 97 15 41 44 26 55 55 20 6
[7,] 35 80 10 33 95 60 17 27 13 2
[8,] 83 70 89 21 42 49 59 45 35 33
[9,] 47 55 91 59 16 54 33 61 47 32
[10,] 2 51 24 19 59 69 24 88 76 98
It needs sample names and OTU names, the index names of the matrix Your
table might already have this
rownames(otumat) <- paste0("OTU", 1:nrow(otumat)) colnames(otumat) <- paste0("Sample", 1:ncol(otumat)) otumat
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
OTU1 4 41 60 84 33 31 70 90
OTU2 69 59 14 75 92 69 26 30
OTU3 63 66 3 23 17 89 84 95
OTU4 4 25 69 88 30 35 14 36
OTU5 53 35 3 20 18 53 56 60
OTU6 35 97 15 41 44 26 55 55
OTU7 35 80 10 33 95 60 17 27
OTU8 83 70 89 21 42 49 59 45
OTU9 47 55 91 59 16 54 33 61
OTU10 2 51 24 19 59 69 24 88
Sample9 Sample10
OTU1 71 30
OTU2 90 54
OTU3 35 81
OTU4 72 18
OTU5 84 1
OTU6 20 6
OTU7 13 2
OTU8 35 33
OTU9 47 32
OTU10 76 98
Now we need a pretend taxonomy table
taxmat = matrix(sample(letters, 70, replace = TRUE), nrow = nrow(otumat), ncol = 7) rownames(taxmat) <- rownames(otumat) colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species") taxmat
Domain Phylum Class Order Family Genus Species
OTU1 "k" "m" "c" "f" "h" "k" "x"
OTU2 "r" "t" "f" "z" "u" "k" "e"
OTU3 "g" "e" "o" "l" "f" "s" "x"
OTU4 "k" "c" "d" "j" "y" "y" "c"
OTU5 "q" "c" "p" "p" "s" "w" "h"
OTU6 "i" "r" "v" "t" "z" "x" "n"
OTU7 "i" "u" "h" "n" "a" "x" "a"
OTU8 "r" "a" "c" "i" "h" "z" "w"
OTU9 "a" "e" "q" "o" "f" "q" "b"
OTU10 "u" "w" "o" "e" "y" "m" "e"
class(otumat)
[1] "matrix"
class(taxmat)
[1] "matrix"
Note how these are just vanilla R matrices. Now let's tell phyloseq how to combine them into a phyloseq object.
library("phyloseq") OTU = otu_table(otumat, taxa_are_rows = TRUE) TAX = tax_table(taxmat) OTU
OTU Table: [10 taxa and 10 samples]
taxa are rows
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
OTU1 4 41 60 84 33 31 70 90
OTU2 69 59 14 75 92 69 26 30
OTU3 63 66 3 23 17 89 84 95
OTU4 4 25 69 88 30 35 14 36
OTU5 53 35 3 20 18 53 56 60
OTU6 35 97 15 41 44 26 55 55
OTU7 35 80 10 33 95 60 17 27
OTU8 83 70 89 21 42 49 59 45
OTU9 47 55 91 59 16 54 33 61
OTU10 2 51 24 19 59 69 24 88
Sample9 Sample10
OTU1 71 30
OTU2 90 54
OTU3 35 81
OTU4 72 18
OTU5 84 1
OTU6 20 6
OTU7 13 2
OTU8 35 33
OTU9 47 32
OTU10 76 98
TAX
Taxonomy Table: [10 taxa by 7 taxonomic ranks]:
Domain Phylum Class Order Family Genus Species
OTU1 "k" "m" "c" "f" "h" "k" "x"
OTU2 "r" "t" "f" "z" "u" "k" "e"
OTU3 "g" "e" "o" "l" "f" "s" "x"
OTU4 "k" "c" "d" "j" "y" "y" "c"
OTU5 "q" "c" "p" "p" "s" "w" "h"
OTU6 "i" "r" "v" "t" "z" "x" "n"
OTU7 "i" "u" "h" "n" "a" "x" "a"
OTU8 "r" "a" "c" "i" "h" "z" "w"
OTU9 "a" "e" "q" "o" "f" "q" "b"
OTU10 "u" "w" "o" "e" "y" "m" "e"
physeq = phyloseq(OTU, TAX) physeq
phyloseq-class experiment-level object
otu_table() OTU Table: [ 10 taxa and 10 samples ]
tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
plot_bar(physeq, fill = "Family")
— Reply to this email directly or view it on GitHub.
Joey,
sorry for replying with delay; I've had an offline weekend.
Thanks a lot for the step-by-step tutorial on creating a phyloseq object from vanilla R. I've reformatted my data accordingly and after some tinkering I've managed to load everything, and it seems to work fine. For me, this way of handling the data was much more efficient, as soon as I got a grasp of some phyloseq subtleties.
Also, I hope that you didn't get the impression that I was proposing to "give up on" phyloseq for good in my earlier post. I honestly appreciate the work you do here, and even more so the effort you spend on documenting and on answering issues such as this. It's just that I have working R scripts in place already, but wanted to give phyloseq a go as it seemed nice and convenient for several functions that are a pain to implement (again, thanks for the good work!), but felt that the biom import issues I encountered presented a significant obstacle. Anyway, the workaround you pointed out works perfectly for me :-)
Thanks again for the patient answers and nice tutorial above. And sorry @Kristina for entering your thread; but as I see, you've successfully overcome your problems, too!
Best,
Fleury
@kfontanez
I'm really glad that my suggestions solved your problem. That's great! ... But now you're hijacking you're own issue post. Don't be shy about posting a new, unrelated issue. There are some interesting ways of dealing with text sizes with ggplot2 objects, and some that are specific to the output from one or another plot output from plot_ordination
. Feel free to mostly copy-and-paste what you've started above, but maybe use an example dataset in phyloseq for a reproducible example of your issue. This will make it a lot faster and easier for me to help.
@defleury
No worries! Yes, I did have the impression you were giving up on phyloseq for ever and ever, which I naturally felt was a bit hasty. However, I completely understand your frustration when fighting file format issues, in particular because I've had to wage many of these fights myself to create the supporting wrappers in phyloseq. I have not done a good job emphasizing the details that I showed above, and so the fact that this wasn't an obvious option to try is my fault. Please accept my apology.
On the one hand, I wanted to make it clear to R-newbie users that they could try some of the phyloseq examples on their own data without knowing much about R. On the other hand, this did a disservice to the very real support for doing things manually/interactively prior to handing the pieces to phyloseq
to wrap them up in one consistent object and use further phyloseq goodies.
I'll leave this post open until I've created and linked a tutorial just for demonstrating this a little better. There are some hints in the general phyloseq demo, but when I reviewed it to see if I could point to that link, I realized it was quite inadequate.
Thank you both for your interest and feedback, without which phyloseq is not likely to do as well.
I added a new section on the data import tutorial
http://joey711.github.io/phyloseq/import-data.html#manual
It has a more detailed explanation, and includes randomly simulated tree and sample covariate data.
I think this settles the missing documentation for now. I'll probably migrate this as a vignette to include within the package itself as well.
Thanks again for all the useful feedback.
joey
I tried creating a biom file using the python package but ended up manually creating the file. I started from a simple taxa count table where the columns are samples and the rows are taxonomic identifications (genus level). The values represents counts of those genera in each sample. Unfortunately, the biom format python package failed to create properly formatted biom file from that basic input.
So, I manually created the biom file in text wrangler (unix line breaks) but keep encountering an error when trying to import into phyloseq. I know the format is correct because when I check it with the R biom package it recognizes it as a sparse OTU table, biom object.
However, when I open phyloseq and try to import the file, I get the following error:
This seems to me some type of bug in the code rather than a problem with the file. Has anyone seen this type of error before?
My metadata, taxonomy section looks like this:
thanks for your help! Kristina