joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
567 stars 187 forks source link

using the same sample for two treatment categories #278

Closed laurenms closed 10 years ago

laurenms commented 10 years ago

hi, for our experiments, we made two initial microcosm mixtures (filtered and unfiltered), divided each into two, treatment (exposed to light) and control (kept in dark), and sampled them over time. so for every time point we sequenced 4 samples, one treatment (light) and one control (dark) for each of two conditions (filtered and unfiltered). but we only sequenced the initial microcosm mixtures (filtered and unfiltered) once at time 0 so there is an NA in the map file for the treatment (light or dark).

a problem arises using plot_bar. here is the code and attached is the plot. plot_bar(dataR_nosea_taxglom_relab, "Time", fill = "Phylum", facet_grid = Treatment ~ Filtered_Unfiltered)

as you can see in the figure, the initial samples plot under NA. I would like them to plot under both treatments (light and dark). I'm guessing this means I need to create a phyloseq object with the samples labeled differently and in there twice but I haven't been able to figure out how to do this. any suggestions or examples? thank you!!! lauren

plot_bar_phylum

joey711 commented 10 years ago

Are the initial samples closer to the "light" or the "dark"?

It is possible to duplicate the initial values for both Light and Dark, but this risks being misleading to the reader because it will appear in the graphic that you made an extra pair of observations with high precision.

It should only take 3 lines or so, using subset_samples, merge_samples, and a little hacking to re-write the sample names for the duplicated subset of the data before you merge it back with the full dataset. I can try to sketch out an example in a later post.

laurenms commented 10 years ago

The initial samples are what the "light" and "dark" samples are made from so I wouldn't say the initial samples were closer to either.

I will be careful to make it clear in the caption that the initials are plotted twice.

If you had a minute to sketch an example that would be great. I'll look at "subset_samples" and "merge_samples" to see if I can figure it out too.

Thanks!!

----- Original Message ----- From: "Paul J. McMurdie" notifications@github.com To: "joey711/phyloseq" phyloseq@noreply.github.com Cc: "laurenms" lms6@stanford.edu Sent: Monday, December 16, 2013 2:16:44 PM Subject: Re: [phyloseq] using the same sample for two treatment categories (#278)

Are the initial samples closer to the "light" or the "dark"?

It is possible to duplicate the initial values for both Light and Dark, but this risks being misleading to the reader because it will appear in the graphic that you made an extra pair of observations with high precision.

It should only take 3 lines or so, using subset_samples, merge_samples, and a little hacking to re-write the sample names for the duplicated subset of the data before you merge it back with the full dataset. I can try to sketch out an example in a later post.


Reply to this email directly or view it on GitHub: https://github.com/joey711/phyloseq/issues/278#issuecomment-30706780

joey711 commented 10 years ago

Issue 278

library("phyloseq")
data("soilrep")
soilrep
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 16825 taxa and 56 samples ]
## sample_data() Sample Data:       [ 56 samples by 4 sample variables ]
# For speed, lets keep only the most-abundant 100 OTUs
keepTaxa = names(sort(taxa_sums(soilrep), decreasing = TRUE)[1:20])
soilrep = prune_taxa(keepTaxa, soilrep)
soilrep
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 20 taxa and 56 samples ]
## sample_data() Sample Data:       [ 56 samples by 4 sample variables ]
# What are the variables listed?
sample_variables(soilrep)
## [1] "Treatment" "warmed"    "clipped"   "Sample"
get_variable(soilrep, "Treatment")
##  [1] UC UU WU UU WC WU UU UC UC UU WC UC UC WC WC WC UU UU WU WC WU UU WU
## [24] WU WC UC UC WU WU WU WC UU WC UC WC WC UC UU UU WU WU WC UU WU UU UC
## [47] WC WU UC UC UC WU UU WC UC UU
## Levels: UC UU WC WU
# Create a Time variable
sample_data(soilrep)$Time <- substr(sample_data(soilrep)$Sample, 1, 1)
# Rename entries for `warmed` and `clipped`
sample_data(soilrep)$warmed <- ifelse(sample_data(soilrep)$warmed == "yes", 
    "warmed", "unwarmed")
sample_data(soilrep)$clipped <- ifelse(sample_data(soilrep)$clipped == "yes", 
    "clipped", "unclipped")

This dataset has its own control samples, and so can help illustrate the steps needed to duplicate "control" samples and merge them with the rest of the data. We will treat the UU samples at Time-1 as equivalent to the time-zero controls in your study. First I need to add a taxonomyTable to this example dataset because it does not have one. I will adapt the example from the data-import tutorial

taxmat = matrix(sample(letters[1:6], 70, replace = TRUE), nrow = ntaxa(soilrep), 
    ncol = 7)
rownames(taxmat) <- taxa_names(soilrep)
colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", 
    "Species")
soilrep = merge_phyloseq(soilrep, tax_table(taxmat))
soilrep
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 20 taxa and 56 samples ]
## sample_data() Sample Data:       [ 56 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 20 taxa by 7 taxonomic ranks ]

Of course, the tax_table in this case is arbitrary and meaningless, and wouldn't reflect any consistency with a phylogenetic tree, were one available, but that's okay for this illustration.

I want this to match the example's NA control values, so I'll re-assign Time-1 variables entries for warmed as NA, rather than the control ("unwarmed") entries they have now.

controlSamples = sample_names(subset_samples(soilrep, warmed == "unwarmed" & 
    Time == "1"))
sample_data(soilrep)[controlSamples, ]$warmed <- NA
sample_data(soilrep)
## Sample Data:        [56 samples by 5 sample variables]:
##        Treatment   warmed   clipped Sample Time
## a_C026        UC unwarmed   clipped    6CC    6
## a_C066        UU unwarmed unclipped    3UC    3
## a_C070        WU   warmed unclipped    5UW    5
## a_C074        UU unwarmed unclipped    2UC    2
## a_C075        WC   warmed   clipped    5CW    5
...

1. Copy a subset of the control samples you want to duplicate

In this case, warmed is NA.

controlUU = subset_samples(soilrep, is.na(warmed) & Time == "1")
controlUU
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 20 taxa and 5 samples ]
## sample_data() Sample Data:       [ 5 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 20 taxa by 7 taxonomic ranks ]

2. Hacking part: rename each sample in its component object

# sample data
sd = sample_data(controlUU)
rownames(sd) <- paste0(rownames(sd), "_dup")
# otu_table
OTU = otu_table(controlUU)
sample_names(OTU) <- paste0(sample_names(OTU), "_dup")
# Avoid conflict between factor and character
sd$clipped <- as.character(sd$clipped)
# Pretend these came from the warmed case, even though they didn't They are
# already in the main dataset as NA
sd$warmed <- "warmed"

3. Add the copied/sample-renamed components to the original dataset

data1 = merge_phyloseq(soilrep, sd, OTU)
data1
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 20 taxa and 61 samples ]
## sample_data() Sample Data:       [ 61 samples by 5 sample variables ]
## tax_table()   Taxonomy Table:    [ 20 taxa by 7 taxonomic ranks ]

Also need to set the original control copies as "unwarmed". Would be "Light" or "Dark" in your example.

sample_data(data1)$warmed[is.na(sample_data(data1)$warmed)] <- "unwarmed"

4. Compare both plots, with and without duplicated samples

plot_bar(soilrep, "Time", fill = "Phylum", facet_grid = warmed ~ clipped)

without-fix

plot_bar(data1, "Time", fill = "Phylum", facet_grid = warmed ~ clipped)

with-fix-and-duplicated-samples

joey711 commented 10 years ago

Note that I fixed a minor internal issue in phyloseq to allow the hacking part to work. Relates to character variables being coerced to factors automatically when sample_data was called. This is now fixed in version 1.7.11, and so you'll need to install this version to get the exact same results as above. I'm testing the version now before I roll it into the master branch on GitHub.

laurenms commented 10 years ago

Thanks so much Joey!!!

To update phyloseq to 1.7.11, I did the following:

install.packages("devtools") library("devtools") install_github("phyloseq", "joey711")

Here is the output. There were 18 warnings but it seemed to work. Does that seem right to you???

laurenms commented 10 years ago

Joey, I followed your example with my data and it worked perfectly! Thank you!!!

I had a quick question about the order of things. Should I first make the object with the duplicate initial condition, then rarefy, then use tax_glom or topp/topf and plot_bar? Or should I rarefy first, then duplicate the initials samples, then tax_glom and plot_bar? If I wanted to plot relative abundance would I transform_samples before or after tax_glom or topp/topf?

Thanks again! lauren

joey711 commented 10 years ago

Lauren,

Sorry, I haven't finished testing version 1.7.11, that is why when you re-installed you still ended up with 1.7.10. I'm working on it and will have it done very soon.

As for pre-processing your counts, you probably want to do any agglomeration (tax_glom) first, then filtering, then transformation... but order can depend on what your filters/transformations are going to be. It is probably best to duplicate samples at the end, especially if your filter(s) depend on the number of samples that pass certain criteria.

As for rarefying: You shouldn't rarefy. Ever.

See this thread on the topic:

https://github.com/joey711/phyloseq/issues/229

And for completeness, Susan and I explain in detail why you shouldn't rarefy in our pre-publication draft on Q-Bio-arXiv.

http://arxiv.org/abs/1310.0424

Make sure to look at revision 2, its figures are nicer :-) We're hoping to get the complete article out in an open-access journal very soon.

Hope that helps. Version 1.7.10 may have worked fine for you. If I can get 1.7.11 soon enough for you then you should update and rerun it with that.

Cheers

joey

p.s. I'll close this once 1.7.11 is officially in the master branch...

joey711 commented 10 years ago

1.7.11 is now available. This should complete what you need for this issue.

Hope that works!

joey