joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

Issue with transforming data to relative abundance #285

Closed ClaireA closed 10 years ago

ClaireA commented 10 years ago

After following these commands:

transform_sample_counts(oneyear, function(OTU) OTU/sum(OTU))

oneyear.firmi = filter_taxa(oneyear, function(x) sum(x > 3) > (0.2 * length(x)), TRUE)

I assumed that the data would have been trasnformed to relative abundances. Yet, when I plot the data using:

p = plot_bar(oneyear.firmi, "Phenotype", fill = "Genus", title = title, facet_grid = ~Family)

p + geom_bar(aes(color = Genus, fill = Genus), stat = "identity")

Has anyone reported issues while plotting fractional abundances??

Thanks,

Claire

joey711 commented 10 years ago

Your tranformation call didn't get saved anywhere. The phyloseq class isn't a reference class. Function outputs must be explicitly stored to be available later.

Here is the revised code that should work. The first line is the change to note most carefully.

oneyearRelabund = transform_sample_counts(oneyear, function(OTU) OTU/sum(OTU))
oneyear.firmi = filter_taxa(oneyearRelabund, function(x) sum(x > 3) > (0.2 * length(x)), TRUE)
p = plot_bar(oneyear.firmi, "Phenotype", fill = "Genus", title = title, facet_grid = ~Family)
p + geom_bar(aes(color = Genus, fill = Genus), stat = "identity")

Thanks for posting your issue. Someone else may get stuck on the same thing.

ClaireA commented 10 years ago

Hi and thansk for your quick response,

I failed to include in teh command that I did create an object while using the transform_sample_counts command. Nevertheless, I just did it again and it is still plotting without normalizing the data.

ANy idea why this is?I can always send you the files...

Claire

Marie-Claire Arrieta. M.Sc. Ph.D. Postdoctoral fellow Finlay Lab Michael Smith Laboratories

301 - 2185 East Mall

University of British Columbia Vancouver, BC. Canada V6T 1Z4

Telephone: (604)-822-1643 Fax: (604)-827-5202


From: Paul J. McMurdie [notifications@github.com] Sent: January-08-14 3:53 PM To: joey711/phyloseq Cc: Arrieta, Marie Claire Subject: Re: [phyloseq] Issue with transforming data to relative abundance (#285)

Your tranformation call didn't get saved anywhere. The phyloseq class isn't a reference class. Function outputs must be explicitly stored to be available later.

Here is the revised code that should work. The first line is the change to note most carefully.

oneyearRelabund = transform_sample_counts(oneyear, function(OTU) OTU/sum(OTU)) oneyear.firmi = filter_taxa(oneyearRelabund, function(x) sum(x > 3) > (0.2 * length(x)), TRUE) p = plot_bar(oneyear.firmi, "Phenotype", fill = "Genus", title = title, facet_grid = ~Family) p + geom_bar(aes(color = Genus, fill = Genus), stat = "identity")

Thanks for posting your issue. Someone else may get stuck on the same thing.

— Reply to this email directly or view it on GitHubhttps://github.com/joey711/phyloseq/issues/285#issuecomment-31889498.

joey711 commented 10 years ago

@ClaireA

If I can reproduce this error, then I can update it and re-open it. Is there some example code you can post (using one of the datasets in phyloseq, like GlobalPatterns) that reproduces this problem completely?

Thanks!

joey

ClaireA commented 10 years ago

Hello and thanks for replying,

I was away on vacation and I am just getting back to this.

Here is the list of commands I am using for my dataset and following that is the commands I used for the GlobalPatterns dataset.

setwd("~/Documents/Sequencing/CHILD")

library("phyloseq")

Loading required package: ade4

Attaching package: ‘ade4’

The following object is masked from ‘package:base’:

within

Loading required package: picante

Loading required package: ape

Loading required package: vegan

Loading required package: permute

Loading required package: lattice

This is vegan 2.0-9

Attaching package: ‘vegan’

The following object is masked from ‘package:ade4’:

cca

Loading required package: nlme

Warning message:

package ‘phyloseq’ was built under R version 3.0.2

file = "childfile3months.txt"

map = "childmap3months.txt"

parse_taxonomy_simple = function(char.vec) {

  • ranks = c("Domain", "Phylum", "Class", "Order", "Family", "Genus",
  • "Species", "Strain")
  • rv = strsplit(char.vec, ";")[[1]]
  • rv = c(rv, rep(NA, length(ranks) - length(rv)))
  • names(rv) = ranks
  • rv
  • }

child3months.table = import_qiime(otufilename = "childfile3months.txt", mapfilename = "childmap3months.txt", parseFunction = parse_taxonomy_simple)

Processing map file...

Processing otu/tax file...

Reading and parsing file in chunks ... Could take some time. Please be patient...

Building OTU Table in chunks. Each chunk is one dot.

....Building Taxonomy Table...

C3F = filter_taxa(child3months.table, function(x) sum(x > 3) > (0.2 * length(x)), TRUE)

TopNOTUs <- names(sort(taxa_sums(C3F), TRUE)[1:100])

C3FT100 <- prune_species(TopNOTUs, C3F)

C3FT = transform_sample_counts(C3FT100, function(OTU) OTU/sum(OTU))

plot_taxa_bar(C3FT, "Family", x = "Phenotype", fill = "TaxaGroup")

p = plot_bar(C3FT, "Phenotype", fill = "Genus", facet_grid = ~Family)

p + geom_bar(aes(color = Genus, fill = Genus), stat = "identity")

Error: could not find function "geom_bar"

library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.0/Resources/library")

p = plot_bar(C3FT, "Phenotype", fill = "Genus", facet_grid = ~Family)

p + geom_bar(aes(color = Genus, fill = Genus), stat = "identity")

data(GlobalPatterns)

GPF = filter_taxa(GlobalPatterns, function(x) sum(x > 3) > (0.2 * length(x)), TRUE)

TopNOTUs <- names(sort(taxa_sums(GPF), TRUE)[1:100])

GPF100 <- prune_species(TopNOTUs, GPF)

GPFT = transform_sample_counts(GPF100, function(OTU) OTU/sum(OTU))

plot_taxa_bar(GPFT, "Phylum", x = "SampleType", fill = "TaxaGroup")

Attached are 3 graphs. Two graphs are on my dataset and the third one of the GlobalPattern's one. The graph 'familyabundance3months' shows relative abundances of different taxa BUT the relative abundance proportions are not correct. It is not only impossible that one particular family (Bifidobacteriaceae) has a relative abundance of 1, but also there are other families present in the same phenotype. I also saw the same problem in the graph from GlobalPatterns 'GPexample.png' where the sum of the relative abundances of different taxa within the same SampleType are well over 1.

Someone suggested that I used plot_bar instead of plot_taxa_bar and I tried that for my data set and created the graph 'plot_barexamplearrieta'. As you can see, the relative abundances appear correct, however, the data is no longer normalized and the height of the bars are a reflection of the number of subjects in each Phenotype. In other words, the small bars always correspond to the "AW" group which has the least amount of samples in the whole data set, and the NAW group, with the highest number of samples, also has the tallest bars.

Any idea how to fix it? I am presenting in a meeting next week and would love to show some of my data by I just can't trust these graphs. If you'd like, I can also send you my 'map' and 'file' files.

Thanks!

Claire

Marie-Claire Arrieta. M.Sc. Ph.D. Postdoctoral fellow Finlay Lab Michael Smith Laboratories

301 - 2185 East Mall

University of British Columbia Vancouver, BC. Canada V6T 1Z4

Telephone: (604)-822-1643 Fax: (604)-827-5202


From: Paul J. McMurdie [notifications@github.com] Sent: January-22-14 10:01 AM To: joey711/phyloseq Cc: Arrieta, Marie Claire Subject: Re: [phyloseq] Issue with transforming data to relative abundance (#285)

@ClaireAhttps://github.com/ClaireA

If I can reproduce this error, then I can update it and re-open it. Is there some example code you can post (using one of the datasets in phyloseq, like GlobalPatterns) that reproduces this problem completely?

Thanks!

joey

— Reply to this email directly or view it on GitHubhttps://github.com/joey711/phyloseq/issues/285#issuecomment-33049286.

joey711 commented 10 years ago

Claire,

plot_taxa_bar is an officially deprecated function, and would have given you a clear warning as much if you were using the most recent version of phyloseq. It has been removed from phyloseq tutorials and documentation for more than a year, and I don't recommend using it.

I still cannot reproduce the other errors. One thing to keep in mind is that plot_bar will stack observations from multiple samples corresponding to the same group. If you need a special organization of the data you have to be careful how those stacks end up appearing, as they will not represent comparable totals across sample groups if one group has more samples in it, or if the library sizes have not been normalized.

See the Restroom Biogeography tutorial in the phyloseq demo repository. It shows how to perform additional adjustments to the data to mimic the (not necessarily recommended) common graphic organization of the stacked bars.

Also, if you could give me the precise code to reproduce your problem but with GlobalPatterns dataset, then I can give you a precise solution. I would like to make sure you feel comfortable presenting your results using phyloseq. Please keep in mind that this is package uses unit tests and rebuilds tutorials from multiple datasets on every new version. While bugs do come up from time to time, it is very unlikely that there is some massive misrepresentation of your data originating from functions like plot_bar. Most of the time when I help a user with an issue like this, there is something missing, or mis-organized/mislabeled in their data... or subsequent ggplot2 commands are doing something they did not expect. I can probably help clarify any of those issues if I have access to the data, or a reproducible example using included data.

Thanks for the feedback as always, and for giving phyloseq a chance

joey

ClaireA commented 10 years ago

Hello Joey and thank you for your quick and detailed reply. Here is a dropbox link to my 'map' and 'file' files.

REDACTED

And here is the code using plot_bar that is giving me an error:

child3months.table = import_qiime(otufilename = "childfile3months.txt", mapfilename = "childmap3months.txt", parseFunction = parse_taxonomy_simple) Processing map file... Processing otu/tax file...

Reading and parsing file in chunks ... Could take some time. Please be patient...

Building OTU Table in chunks. Each chunk is one dot. ....Building Taxonomy Table...

C3F = filter_taxa(child3months.table, function(x) sum(x > 3) > (0.2 * length(x)), TRUE)

TopNOTUs <- names(sort(taxa_sums(C3F), TRUE)[1:100]) C3FT100 <- prune_species(TopNOTUs, C3F) C3FT = transform_sample_counts(C3FT100, function(OTU) OTU/sum(OTU)) p = plot_bar(C3FT, "Phenotype", fill = "Genus", facet_grid = ~Family) p + geom_bar(aes(color = Genus, fill = Genus), stat = "identity") Error: could not find function "geom_bar"

library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.0/Resources/library") p = plot_bar(C3FT, "Phenotype", fill = "Genus", facet_grid = ~Family) p + geom_bar(aes(color = Genus, fill = Genus), stat = "identity")

Again, the problem that I have is that although it would appear as the group "AW" has a very low abundance of many different taxa, this group in particular has much less samples than the other 3 groups. To make it even more suspicious that the data is no longer normalized, the height of the bars in almost all of the taxa shown in teh graph reflect the number of samples in each group. The number of samples in each phenotype are: Phenotypes:

ANW: 86 AW: 22 NANW: 74 NAW: 134

As you can see, it appears as if the bars are adding up the abundances of the taxa, instead of averaging them.

Thanks again,

Claire

joey711 commented 10 years ago

Claire,

I redacted your link in case you were not aware that when you reply through github it is posted on a completely public forum page:

https://github.com/joey711/phyloseq/issues/285

I still have the link in my own email and can use it if necessary, as I am the intended recipient. If you don't mind, or want, the link to be shared with others, then feel free to add it back.

Okay, from your response I can tell that my suspicion was correct. The example I recommended on the Restroom Biogeography reanalysis vignette should work fine for you, and even reproduces the same issue you are having here and explains why. Please keep in mind this is the expected behavior of both plot_bar and ggplot2: when the variable mapped to the horizontal axis is discrete and has multiple samples per class, the elements in that bar plot are stacked on top of one another. There are ways to tell ggplot2 to have the bars "dodged" next to each other, or separated into panels. Both of which could still work if you wanted to display the intra-class variation in your data.

In your case, however, it sounds as though you just want to compare across sample classes (you've labeled them Phenotypes, it appears), using a simple average. You will need to perform this averaging with separate commands before plotting, and I personally like having an important data transformation like that separated (in my code) from the graphics commands. The main function for performing what you want is merge_samples. Its documentation should be adequate, but the Restroom Biogeography tutorial also likely has some useful steps as well.

Excerpt from Restroom Biogeography tutorial Figure 1 section:

... Hey that was easy! But the abundance values get over 500. How come? Oh, each of these categories includes more than one sample, so we expect them to exceed 500. In fact, you can see that the totals are multiples of 500. This is easy to fix by re-doing our earlier transformation on restroomR after mergeing by category. The phyloseq package includes a function specifically for doing this kind of categorical merge, called merge_samples.

(This is what you want to use, Claire)

restroomRm = merge_samples(restroomR, "SURFACE")

Repair the merged values associated with each surface after merge.

sample_data(restroomRm)$SURFACE <- levels(sample_data(restroomR)$SURFACE)

Transform to percentages of total available.

restroomRm = transform_sample_counts(restroomRm, function(x) 100 * x/sum(x))

Now we can plot Figure 1A with each category on equivalent footing.

title = "Figure 1 Part A (remake), attempt 2"
plot_bar(restroomRm, "SURFACE", fill = "family19", title = title) + coord_flip() + 
    ylab("Percentage of Sequences")

See the page itself for more details, and a rendering of the output:

http://joey711.github.io/phyloseq-demo/Restroom-Biogeography

joey711 commented 10 years ago

As there does not appear to be a bug, and the previous should solve your issue, I will re-close this. I can always re-open the issue if there are further problems. Please respond either way so that it stays clear for future users whether this solved the problem.

Thanks as always for the feedback!

joey

ClaireA commented 10 years ago

Thank you, it completely corrected the issue!

Claire

Marie-Claire Arrieta. M.Sc. Ph.D. Postdoctoral fellow Finlay Lab Michael Smith Laboratories

301 - 2185 East Mall

University of British Columbia Vancouver, BC. Canada V6T 1Z4

Telephone: (604)-822-1643 Fax: (604)-827-5202


From: Paul J. McMurdie [notifications@github.com] Sent: February-03-14 1:40 PM To: joey711/phyloseq Cc: Arrieta, Marie Claire Subject: Re: [phyloseq] Issue with transforming data to relative abundance (#285)

As there does not appear to be a bug, and the previous should solve your issue, I will re-close this. I can always re-open the issue if there are further problems. Please respond either way so that it stays clear for future users whether this solved the problem.

Thanks as always for the feedback!

joey

— Reply to this email directly or view it on GitHubhttps://github.com/joey711/phyloseq/issues/285#issuecomment-34003378.