joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 186 forks source link

plot_bar() silently drops samples, for a certain sample name scheme #873

Open mikemc opened 6 years ago

mikemc commented 6 years ago

If sample names are strings of numbers and the name of a sample starts with a zero, then that sample will be dropped without any warning or error when calling plot_bar(). Here is a minimal working example:

sample.names <- c("01", "10", "12", "13")

sam.df <- data.frame(Var.1 = c(rep("A", 2), rep("B", 2)))
rownames(sam.df) <- sample.names

seq.tab <- matrix(rpois(16, 10), nrow=4)
rownames(seq.tab) <- sample.names
colnames(seq.tab) <- paste0("taxon", 1:4)

ps <- merge_phyloseq(sample_data(sam.df), otu_table(seq.tab, taxa_are_rows=F))
plot_bar(ps)
jeffkimbrel commented 6 years ago

This appears to be happening in psmelt, which plot_bar calls. Specifically this line: mdf = reshape2::melt(as(otutab, "matrix")). The melt will remove left side padded zeroes, so sample 01 becomes 1. Later, when this is called: mdf <- merge(mdf, sdf, by.x = "Sample"), it can't properly merge the samples because the names are different. And it is dropped.

I think the best solution is to just not use numerics as sample names. Depending on how you import your data, keeping check.names = T will result in an X being prepended to your sample names, which would also fix the problem by forcing them into characters.

mikemc commented 6 years ago

Thanks for your observation and suggestion, Jeff. I'm using data from the Microbiome Quality Control project, which used randomized numeric identifiers of length 10, sometimes with a leading zero. But pre-pending an "X" to the names is a good enough solution for me. Still, it would be good at least for a warning to be printed when samples are stripped, which could be easily checked in the plot_bar function.

TamusT commented 6 years ago

hey guys. so, i am quite new in programming and encountered the same problem when i named my sample as listed below. so my question is where should i put this check.names function in my code?

sample_name <- c("07", "08", "09", "10")
sample <- data.frame(sample_name, row.names=sample_name)
jeffkimbrel commented 6 years ago

I think honestly the best way (other than not using numeric sample names to begin with) is to just edit the phyloseq object sample names. Here's how using the GlobalPatterns dataset.

> library("phyloseq")
> data("GlobalPatterns")

> dput(sample_names(GlobalPatterns))

c("CL3", "CC1", "SV1", "M31Fcsw", "M11Fcsw", "M31Plmr", "M11Plmr", 
"F21Plmr", "M31Tong", "M11Tong", "LMEpi24M", "SLEpi20M", "AQC1cm", 
"AQC4cm", "AQC7cm", "NP2", "NP3", "NP5", "TRRsed1", "TRRsed2", 
"TRRsed3", "TS28", "TS29", "Even1", "Even2", "Even3")

dput will print out the sample names vector in the same format for creating a vector. Just copy that output and edit the names however you see fit, and save that back into a vector named samples. You can see below I changed the name of a couple of them.

> samples = c("CL3-edit", "CC1", "SV1", "M31Fcsw", "M11Fcsw", "M31Plmr", "M11Plmr", 
            "F21Plmr", "M31Tong", "M11Tong-edit", "LMEpi24M", "SLEpi20M", "AQC1cm", 
            "AQC4cm", "AQC7cm", "NP2", "NP3", "NP5", "TRRsed1", "TRRsed2", 
            "TRRsed3", "TS28", "TS29", "Even1", "Even2", "Even3")

# copy the GlobalPatterns object and update the names
> GlobalPatterns2 = GlobalPatterns
> sample_names(GlobalPatterns2) = samples

# check that the names changed
> sample_names(GlobalPatterns2)
 [1] "CL3-edit"     "CC1"          "SV1"          "M31Fcsw"      "M11Fcsw"      "M31Plmr"      "M11Plmr"      "F21Plmr"     
 [9] "M31Tong"      "M11Tong-edit" "LMEpi24M"     "SLEpi20M"     "AQC1cm"       "AQC4cm"       "AQC7cm"       "NP2"         
[17] "NP3"          "NP5"          "TRRsed1"      "TRRsed2"      "TRRsed3"      "TS28"         "TS29"         "Even1"       
[25] "Even2"        "Even3" 

These sample name changes should be preserved everywhere, including the otu_table. Just be sure you don't change the order of your samples in the vector.

I should add that if you just want to add "X" or "Sample" to the beginning of all samples, try this before adding to the new phyloseq object:

samples = paste0("X", samples)
samples = paste0("Sample", samples)

Or, even easier

GlobalPatterns2 = GlobalPatterns
sample_names(GlobalPatterns2) = paste0("X", sample_names(GlobalPatterns2))

> sample_names(GlobalPatterns2)
 [1] "XCL3"     "XCC1"          "XSV1"          "XM31Fcsw"      "XM11Fcsw"      "XM31Plmr"      "XM11Plmr"     
 [8] "XF21Plmr"      "XM31Tong"      "XM11Tong" "XLMEpi24M"     "XSLEpi20M"     "XAQC1cm"       "XAQC4cm"      
[15] "XAQC7cm"       "XNP2"          "XNP3"          "XNP5"          "XTRRsed1"      "XTRRsed2"      "XTRRsed3"     
[22] "XTS28"         "XTS29"         "XEven1"        "XEven2"        "XEven3"  
TamusT commented 6 years ago

thanks for everything

apascualgarcia commented 6 years ago

I had the same problem, and the point is that when the function merge_samples() is used to create the plot bars, it is quite frequent to merge the samples following a factor, and factors are also frequently labelled with numbers. Therefore, it is not rare that some of the factors will not appear (in my case those labelled as "1.0", "2.0", "3.0", etc.) Of course one can rename the factors, but I think this is something to fix (or at least to warn to the user) because when the factor has many levels one may not realize.