Open Ecotone23 opened 1 year ago
Hi @Ecotone23
What you saw in your original plot with x=Site
is the summed relative abundance from all the samples from each site, hence it can get greater than 100% if you have more than one sample in one site and also if they have quite high relative abundance.
To create your desired plot (I think that's what you meant), you will merge samples from the same site together, using merge_samples()
, much like what we do with tax_glom()
where taxa from the same phylum are merged (summed).
library(phyloseq)
library(ggplot2)
library(magrittr)
data(GlobalPatterns)
# To simplify the GP data
ps = subset_taxa(GlobalPatterns, Kingdom == "Bacteria")
ps = prune_taxa(taxa_sums(ps) > 1000, ps)
total = median(sample_sums(ps))
standf = function(x, t=total) round(t * (x / sum(x)))
normalized_ps = transform_sample_counts(ps, standf)
# Merge sample by their sampling site
# Here we have 9 "samples" (SampleType)
merged_ps_phylum = merge_samples(normalized_ps, "SampleType") # summed
merged_ps_phylum
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 1973 taxa and 9 samples ]
#> sample_data() Sample Data: [ 9 samples by 7 sample variables ]
#> tax_table() Taxonomy Table: [ 1973 taxa by 7 taxonomic ranks ]
#> phy_tree() Phylogenetic Tree: [ 1973 tips and 1972 internal nodes ]
# Sample Data has Sample ID per row
normalized_ps_phylum1 = tax_glom(normalized_ps, "Phylum", NArm = TRUE)
normalized_ps_phylum1
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 26 taxa and 26 samples ]
#> sample_data() Sample Data: [ 26 samples by 7 sample variables ]
#> tax_table() Taxonomy Table: [ 26 taxa by 7 taxonomic ranks ]
#> phy_tree() Phylogenetic Tree: [ 26 tips and 25 internal nodes ]
# Sample Data has Sample Type per row
normalized_ps_phylum2 = tax_glom(merged_ps_phylum, "Phylum", NArm = TRUE)
normalized_ps_phylum2
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 26 taxa and 9 samples ]
#> sample_data() Sample Data: [ 9 samples by 7 sample variables ]
#> tax_table() Taxonomy Table: [ 26 taxa by 7 taxonomic ranks ]
#> phy_tree() Phylogenetic Tree: [ 26 tips and 25 internal nodes ]
normalized_ps_phylum1_relabun = transform_sample_counts(normalized_ps_phylum1, function(OTU) OTU/sum(OTU) * 100)
normalized_ps_phylum2_relabun = transform_sample_counts(normalized_ps_phylum2, function(OTU) OTU/sum(OTU) * 100)
psmelt(normalized_ps_phylum1_relabun) %>%
ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(y = "Relative Abundance", title = "Phylum Relative Abundance")
psmelt(normalized_ps_phylum2_relabun) %>%
ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(y = "Relative Abundance", title = "Phylum Relative Abundance")
Created on 2023-09-29 with reprex v2.0.2
Absolutely @ycl6, it's working like a charm! I truly appreciate your troubleshooting assistance – you're the real expert here. I have a follow-up question:
For clarity, I'm beginning with my phyloseq object (ps) and opting not to perform normalization. My objective is to group phyla with relative abundances less than 1% into a new category named "Other." I've employed the following code to achieve this:
ps_phylum <- tax_glom(ps, "Phylum", NArm = TRUE)
ps_phylum_relabun <- transform_sample_counts(ps_phylum, function(x) x/sum(x) )
taxa_abundance_table_phylum <- psmelt(ps_phylum_relabun)
taxa_abundance_table_phylum$Phylum <- as.character(taxa_abundance_table_phylum$Phylum)
medians <- ddply(taxa_abundance_table_phylum, ~Phylum, function(x) c(median=median(x$Abundance)))
remainder <- medians[medians$median <= 0.01,]$Phylum
remainder #here, only one phylum, Proteobacteria,shows a relative abundance greater than 1%
taxa_abundance_table_phylum1[taxa_abundance_table_phylum1$Phylum %in% remainder,]$Phylum <- "Other"
All phyla, except Proteobacteria, have been listed by "remainder" since their relative abundances are less than 1%. However, when I employ the codes you provided in comment #1521 to generate a relative abundance table, it results in the following output:
write.table(ps %>% tax_glom(taxrank = "Phylum") %>% transform_sample_counts(function(x) {x/sum(x)}) %>% psmelt() %>% select(Phylum, Sample, Abundance) %>% spread(Sample, Abundance), file = "relative_abundance.phylum.tsv", sep = "\t", quote = F, row.names = F, col.names = T)
The column "relative abundance(%)" was added manually and calculated as the sum of the values in each row divided by number of samples and times 100: sum(X1:Xn)/N*100
There are six phyla with relative abundances exceeding 1%, and the total of these relative abundances for each phylum adds up to 100%. Therefore, Is it possible that the previous calculation using "medians" and "remainder" was incorrect?
"Therefore, Is it possible that the previous calculation using "medians" and "remainder" was incorrect?" --The issue appears to be related to the use of the median function, which is returning the median value for each phylum instead of the mean value. I'm not entirely certain whether the median function considers the relative abundance of the median-ranking OTU among all OTUs belonging to a particular phylum.
I also tried the following R codes to group the phyla with relative abundance less than 1% and generated the stacked barplot:
df <- as.data.frame(lapply(sample_data(ps),function (y) if(class(y)!="factor" ) as.factor(y) else y),stringsAsFactors=T) row.names(df) <- sample_names(ps) sample_data(ps) <- sample_data(df)
y1 <- merge_samples(ps, "Site") # merge samples on sample variable of interest y2 <- tax_glom(y1, taxrank = 'Phylum', NArm=TRUE) # agglomerate taxa y3 <- transform_sample_counts(y2, function(x) x/sum(x)*100) #get abundance in % y4 <- psmelt(y3) # create dataframe from phyloseq object y4$Phylum <- as.character(y4$Phylum) #convert to character y4$Phylum[y4$Abundance <= 1] <- "Other" #rename genera with < 1% abundance
StackedBarPlot_phylum <- y4 %>%
ggplot(aes(x =Site, y = Abundance, fill = Phylum)) +
geom_bar(stat = "identity") +
labs(x = "Sampling sites",
y = "Relative abundance (%)",
title = "") +
theme(
axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 1),axis.title.x=element_text(size = 15),
axis.text.y = element_text(size = 12), axis.title.y=element_text(size = 15),
legend.text = element_text(size = 10), legend.position = "right",
strip.text = element_text(size = 12)
)+guides(fill=guide_legend(ncol =1))
StackedBarPlot_phylum
I successfully generated the desired barplot with relative abundance on the Y-axis and sampling sites on the X-axis. However, two new issues have emerged:
Instead of displaying the names of the sampling sites as characters, only numerical values are appearing. The mean relative abundance for the seven phyla displayed in the barplot is expected to be greater than 1%. However, in the provided TSV table, only six phyla have values exceeding 1%. "Epsilonbacteraeota" and "Acidobacteria," which are retained in the barplot, are supposed to have abundances of 0.83% and 0.44%, respectively, according to the TSV table. Is this command (y3 <- transform_sample_counts(y2, function(x) x/sum(x)*100) ) still not getting the mean relative abundance?
@ycl6 Can you help solve the problems? Thank you in advance.
Hi @Ecotone23
I think if the goal is to group the lowly abundant phyla, one way is to simply select to show top N phyla in your plot. There's a similar discussion here #1487 and I have provided some sample code.
library(phyloseq)
library(tidyverse)
data(GlobalPatterns)
# To simplify the GP data
ps = subset_taxa(GlobalPatterns, Kingdom == "Bacteria")
ps = prune_taxa(taxa_sums(ps) > 1000, ps)
# Merge sample by their sampling site
merged_ps = merge_samples(ps, "SampleType") # summed
# Merge taxa by 'Phylum' rank
normalized_ps_phylum = tax_glom(merged_ps, "Phylum", NArm = TRUE)
# Calculate relative abundance
normalized_ps_phylum_relabun = transform_sample_counts(normalized_ps_phylum, function(OTU) OTU/sum(OTU))
# Show taxa_sums
taxaSums = data.frame(tax_table(normalized_ps_phylum_relabun)[,"Phylum"],
taxa_sums = taxa_sums(normalized_ps_phylum_relabun)) %>%
arrange(desc(taxa_sums)) # reverse sort
taxaSums
#> Phylum taxa_sums
#> 360229 Proteobacteria 3.0133407264
#> 331820 Bacteroidetes 1.5118082866
#> 189047 Firmicutes 1.3832461922
#> 549656 Cyanobacteria 1.3818194592
#> 329744 Actinobacteria 0.8098169731
#> 319505 Verrucomicrobia 0.2735084825
#> 36155 Acidobacteria 0.2347197201
#> 64396 Fusobacteria 0.1185165676
#> 513590 Planctomycetes 0.0788220097
#> 357591 Tenericutes 0.0640553136
#> 284052 Chloroflexi 0.0476642440
#> 593891 Gemmatimonadetes 0.0127665437
#> 591182 SAR406 0.0119666800
#> 337082 Chlorobi 0.0111134726
#> 306921 SPAM 0.0102539984
#> 114339 AD3 0.0084536592
#> 144091 WS3 0.0057790605
#> 523166 GN04 0.0051596590
#> 252035 Nitrospirae 0.0040368527
#> 535000 OP8 0.0034604943
#> 184899 SC4 0.0032217200
#> 212910 Spirochaetes 0.0016595890
#> 112742 WPS-2 0.0015506052
#> 312378 Elusimicrobia 0.0011932614
#> 368928 Lentisphaerae 0.0011906972
#> 589390 Armatimonadetes 0.0008757319
# Select top10 phyla
top10 = head(rownames(taxaSums), 10)
top10
#> [1] "360229" "331820" "189047" "549656" "329744" "319505" "36155" "64396"
#> [9] "513590" "357591"
# Remove unselected phyla
y = prune_taxa(top10, normalized_ps_phylum_relabun)
y
#> phyloseq-class experiment-level object
#> otu_table() OTU Table: [ 10 taxa and 9 samples ]
#> sample_data() Sample Data: [ 9 samples by 7 sample variables ]
#> tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
#> phy_tree() Phylogenetic Tree: [ 10 tips and 9 internal nodes ]
# Build data.frame
df1 = data.frame(ID = c(taxa_names(y), "Other"), Phylum = c(tax_table(y)[,"Phylum"], "Other"))
df1
#> ID Phylum
#> 1 329744 Actinobacteria
#> 2 64396 Fusobacteria
#> 3 357591 Tenericutes
#> 4 549656 Cyanobacteria
#> 5 360229 Proteobacteria
#> 6 331820 Bacteroidetes
#> 7 36155 Acidobacteria
#> 8 319505 Verrucomicrobia
#> 9 513590 Planctomycetes
#> 10 189047 Firmicutes
#> 11 Other Other
df2 = t(cbind(otu_table(y), data.frame(Other = 1 - sample_sums(y))))
df2
#> Feces Freshwater Freshwater (creek) Mock Ocean
#> 329744 2.827556e-02 3.323764e-01 1.483344e-02 8.481134e-02 5.920611e-02
#> 64396 2.173577e-05 8.554674e-05 4.024297e-05 3.546782e-05 1.088135e-04
#> 357591 1.269868e-02 8.585336e-06 5.317821e-05 4.875208e-02 1.591687e-05
#> 549656 3.832212e-02 3.692038e-01 6.742414e-01 1.766858e-03 1.927512e-01
#> 360229 2.007726e-02 1.536416e-01 1.756737e-01 2.008655e-01 3.532960e-01
#> 331820 4.302774e-01 1.097908e-01 4.799919e-02 2.517705e-01 3.564371e-01
#> 36155 6.146590e-05 9.106588e-05 8.539106e-03 5.861524e-04 7.813734e-05
#> 319505 3.558252e-03 2.153601e-02 2.589040e-02 5.600182e-04 1.929356e-02
#> 513590 3.082203e-05 1.117504e-02 4.325093e-03 1.536939e-04 6.351987e-03
#> 189047 4.654547e-01 1.367828e-03 4.281359e-03 4.105043e-01 5.955802e-04
#> Other 1.222013e-03 7.233146e-04 4.412293e-02 1.941397e-04 1.186559e-02
#> Sediment (estuary) Skin Soil Tongue
#> 329744 0.0050018788 0.1538367399 0.0693361486 6.213938e-02
#> 64396 0.0001294127 0.0579821390 0.0027240102 5.738920e-02
#> 357591 0.0016044056 0.0004512294 0.0001322885 3.389552e-04
#> 549656 0.0413762027 0.0455025671 0.0092756357 9.379683e-03
#> 360229 0.8239644255 0.3036931707 0.3275282541 6.546008e-01
#> 331820 0.0470968677 0.0749382057 0.0946892096 9.880904e-02
#> 36155 0.0068744650 0.0012088284 0.2172078998 7.259891e-05
#> 319505 0.0045902529 0.0008608324 0.1966244304 5.947341e-04
#> 513590 0.0170466151 0.0002422652 0.0393710069 1.254855e-04
#> 189047 0.0209929228 0.3610409245 0.0026781982 1.163304e-01
#> Other 0.0313225510 0.0002430977 0.0404329179 2.197199e-04
df = cbind(df1, df2) %>%
pivot_longer(-c(ID, Phylum), names_to = "SampleType", values_to = "Abundance") %>%
as.data.frame
head(df)
#> ID Phylum SampleType Abundance
#> 1 329744 Actinobacteria Feces 0.028275563
#> 2 329744 Actinobacteria Freshwater 0.332376375
#> 3 329744 Actinobacteria Freshwater (creek) 0.014833435
#> 4 329744 Actinobacteria Mock 0.084811341
#> 5 329744 Actinobacteria Ocean 0.059206113
#> 6 329744 Actinobacteria Sediment (estuary) 0.005001879
# Re-level Phylum by abundance, "Other" placed at the last
df$Phylum = as.factor(df$Phylum)
levels(df$Phylum) # Before ordering
#> [1] "Acidobacteria" "Actinobacteria" "Bacteroidetes" "Cyanobacteria"
#> [5] "Firmicutes" "Fusobacteria" "Other" "Planctomycetes"
#> [9] "Proteobacteria" "Tenericutes" "Verrucomicrobia"
df$Phylum = factor(df$Phylum, levels = c(taxaSums[top10,"Phylum"], "Other"))
levels(df$Phylum) # After ordering
#> [1] "Proteobacteria" "Bacteroidetes" "Firmicutes" "Cyanobacteria"
#> [5] "Actinobacteria" "Verrucomicrobia" "Acidobacteria" "Fusobacteria"
#> [9] "Planctomycetes" "Tenericutes" "Other"
ggplot(df, aes(SampleType, Abundance, fill = Phylum)) +
geom_col(color = "black") + scale_fill_brewer(palette = "Set3") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
labs(x = "Sample type", y = "Average relative abundance")
If you do want to calculate the median (or mean) and use that as a threshold, you can do that from the OTU table, for example:
# Calculate the median relative abundance of each phylum from OTU table
median_abd = apply(otu_table(normalized_ps_phylum_relabun) * 100, 2, median)
median_abd
#> 329744 212910 64396 357591 549656 360229
#> 6.213938e+00 3.088447e-04 1.088135e-02 3.389552e-02 4.137620e+00 3.036932e+01
#> 337082 331820 591182 593891 535000 312378
#> 1.132431e-03 9.880904e+00 8.400274e-04 9.490802e-03 3.846300e-04 0.000000e+00
#> 306921 36155 252035 144091 523166 319505
#> 1.165537e-03 5.861524e-02 5.769450e-04 1.041831e-03 1.226477e-04 4.590253e-01
#> 368928 513590 112742 589390 284052 184899
#> 5.787951e-05 4.325093e-01 1.442362e-04 6.222425e-05 1.183636e-02 4.807875e-05
#> 114339 189047
#> 1.665053e-04 2.099292e+00
median_abd[median_abd < 1]
#> 212910 64396 357591 337082 591182 593891
#> 3.088447e-04 1.088135e-02 3.389552e-02 1.132431e-03 8.400274e-04 9.490802e-03
#> 535000 312378 306921 36155 252035 144091
#> 3.846300e-04 0.000000e+00 1.165537e-03 5.861524e-02 5.769450e-04 1.041831e-03
#> 523166 319505 368928 513590 112742 589390
#> 1.226477e-04 4.590253e-01 5.787951e-05 4.325093e-01 1.442362e-04 6.222425e-05
#> 284052 184899 114339
#> 1.183636e-02 4.807875e-05 1.665053e-04
Created on 2023-10-04 with reprex v2.0.2
Thank you @ycl6 for the nice codes! I think it will solve most of the problems. While running the merge_samples(ps, "SampleType"), all my categorical variables(including Site) from samples_df were coerced to be "NAs" with a warning: In asMethod(object) : NAs introduced by coercion. I tried to convert the Site name as factor using the following codes before merging samples:
df <- as.data.frame(lapply(sample_data(ps),function (y) if(class(y)!="factor" ) as.factor(y) else y),stringsAsFactors=T) row.names(df) <- sample_names(ps) sample_data(ps) <- sample_data(df)
and it didn't show the warning above when I ran the merge_sample() function again, but I ended up getting a plot that has absolute abundance(not relative abundance) as the y axis, which means the OTUs were not merged by sites.
I tried to give each sampling site a number to replace the string in the original sample file(i.e., sample_df) before merging as a phyloseq object. There was an error of invalid class “phyloseq” object: Component sample names do not match, when I ran the merge_sample() and also the warning of NAs coercion. I also tried to define the numeric Site as a factor, but ended up getting the same error and warning.
I wonder if this is an issue related to my R version? I tried both 4.3.1 and 4.2.3, both ended up the same. Is there another way to merge sample OTUs by sample type or other variables from sample file(i.e., samples_df)?
Hi @Ecotone23 #663 seems to address the same issue you had.
I didn't had this myself, but I guess (like you said), might be the data type of the variable you are trying to merge that's not quite compatible with the merge function.
What do you get when you do str(sample_data(ps))
before merging?
Hi @ycl6 ,
Just tried to do str(sample_data(ps)) before merging and NAs are still introduced by coercion. I found a useful trick that can partially solve the problem:
after merging sample type using "ps2 <- merge_samples(ps, "SampleType") " and introducing NAs, running "sample_data(ps2)$SampleType<-sample_names(ps2)" can recover the original strings for the merged variable SampleType.
Although this cannot recover all other variables from the sample_df that have been coerced to NAs, it is fair enough for generating a relative abundance of phylum(y) against SampleType(x) plot.
Yes I also found this related issue #663 and followed the codes to convert all my categorial variables to factors before merging. This can stop the NA coercion warning but the merging was not successful and I still got numbers instead of character names of my SampleType as the X axis in the plot.
I have yet to discover an ideal resolution for handling the frequently encountered NA coercion issue when executing merge_samples().
I truly appreciate all your comments and suggestions. Thanks a million!
Hi @Ecotone23 Sorry for not making it clear. The reason to see your str(sample_data(ps))
is so that we can figure out how your sample_data
structure is different from the demo data, which works fine.
So let's do a little investigation into what works and what doesn't.
As you can see from my little test, merge_samples
absolutely does not like integers or integers of a factor type. It tolerates characters, but the resulting object's SampleType
becomes NA
. However, it's alright because the sample names are still correctly transferred over.
If you want to avoid the warning message, you can follow the last step where I change it back from character type to factor.
_And yes, you can treat sample_data(ps)
like a data.frame_ ;)
library(phyloseq)
data(GlobalPatterns)
# To simplify the GP data
ps = subset_taxa(GlobalPatterns, Kingdom == "Bacteria")
ps = prune_taxa(taxa_sums(ps) > 1000, ps)
str(sample_data(ps))
#> 'data.frame': 26 obs. of 7 variables:
#> Formal class 'sample_data' [package "phyloseq"] with 4 slots
#> ..@ .Data :List of 7
#> .. ..$ : Factor w/ 26 levels "AQC1cm","AQC4cm",..: 5 4 21 14 11 15 12 9 16 13 ...
#> .. ..$ : Factor w/ 26 levels "ILBC_01","ILBC_02",..: 1 2 3 4 5 6 7 8 9 10 ...
#> .. ..$ : Factor w/ 26 levels "AACGCA","AACTCG",..: 1 2 3 4 5 6 7 8 9 10 ...
#> .. ..$ : Factor w/ 26 levels "AACTGT","ACAGGT",..: 24 13 3 20 9 5 17 7 19 11 ...
#> .. ..$ : Factor w/ 26 levels "AGAGAGACAGG",..: 10 6 18 21 8 9 16 19 26 20 ...
#> .. ..$ : Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
#> .. ..$ : Factor w/ 25 levels "Allequash Creek, 0-1cm depth",..: 4 5 20 14 11 15 12 9 16 13 ...
#> ..@ names : chr "X.SampleID" "Primer" "Final_Barcode" "Barcode_truncated_plus_T" ...
#> ..@ row.names: chr "CL3" "CC1" "SV1" "M31Fcsw" ...
#> ..@ .S3Class : chr "data.frame"
str(sample_data(ps)$SampleType)
#> Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
levels(sample_data(ps)$SampleType)
#> [1] "Feces" "Freshwater" "Freshwater (creek)"
#> [4] "Mock" "Ocean" "Sediment (estuary)"
#> [7] "Skin" "Soil" "Tongue"
ps1 = ps2 = ps3 = ps4 = ps
# ps1, SampleType is factor type with characters
# No error
str(sample_data(ps1)$SampleType)
#> Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
ps1m = merge_samples(ps1, "SampleType")
sample_data(ps1m)
#> X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
#> Feces 19.0 13.5 13.5 16.500000
#> Freshwater 15.0 11.5 11.5 12.000000
#> Freshwater (creek) 2.0 14.0 14.0 13.000000
#> Mock 7.0 25.0 25.0 12.333333
#> Ocean 18.0 17.0 17.0 13.666667
#> Sediment (estuary) 23.0 20.0 20.0 15.000000
#> Skin 12.0 7.0 7.0 9.666667
#> Soil 10.0 2.0 2.0 13.333333
#> Tongue 14.5 9.5 9.5 15.000000
#> Barcode_full_length SampleType Description
#> Feces 13.750000 1 18.500000
#> Freshwater 4.500000 2 15.500000
#> Freshwater (creek) 6.666667 3 2.000000
#> Mock 16.000000 4 7.000000
#> Ocean 17.000000 5 18.000000
#> Sediment (estuary) 14.666667 6 22.666667
#> Skin 14.666667 7 12.000000
#> Soil 11.333333 8 9.666667
#> Tongue 23.000000 9 14.500000
# ps2, SampleType is character type
# Warning: NAs introduced by coercion
sample_data(ps2)$SampleType = as.character(sample_data(ps2)$SampleType)
str(sample_data(ps2)$SampleType)
#> chr [1:26] "Soil" "Soil" "Soil" "Feces" "Feces" "Skin" "Skin" "Skin" ...
ps2m = merge_samples(ps2, "SampleType")
#> Warning in asMethod(object): NAs introduced by coercion
sample_data(ps2m)
#> X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
#> Feces 19.0 13.5 13.5 16.500000
#> Freshwater 15.0 11.5 11.5 12.000000
#> Freshwater (creek) 2.0 14.0 14.0 13.000000
#> Mock 7.0 25.0 25.0 12.333333
#> Ocean 18.0 17.0 17.0 13.666667
#> Sediment (estuary) 23.0 20.0 20.0 15.000000
#> Skin 12.0 7.0 7.0 9.666667
#> Soil 10.0 2.0 2.0 13.333333
#> Tongue 14.5 9.5 9.5 15.000000
#> Barcode_full_length SampleType Description
#> Feces 13.750000 NA 18.500000
#> Freshwater 4.500000 NA 15.500000
#> Freshwater (creek) 6.666667 NA 2.000000
#> Mock 16.000000 NA 7.000000
#> Ocean 17.000000 NA 18.000000
#> Sediment (estuary) 14.666667 NA 22.666667
#> Skin 14.666667 NA 12.000000
#> Soil 11.333333 NA 9.666667
#> Tongue 23.000000 NA 14.500000
# ps3, SampleType is factor type with integers
# Error: invalid class “phyloseq” object
levels(sample_data(ps3)$SampleType) = 1:nlevels(sample_data(ps3)$SampleType)
str(sample_data(ps3)$SampleType)
#> Factor w/ 9 levels "1","2","3","4",..: 8 8 8 1 1 7 7 7 9 9 ...
ps3m = merge_samples(ps3, "SampleType")
#> Error in validObject(.Object): invalid class "phyloseq" object:
#> Component sample names do not match.
#> Try sample_names()
#sample_data(ps3m)
# ps3, SampleType is integer type
# Error: invalid class “phyloseq” object
sample_data(ps4)$SampleType = as.integer(sample_data(ps4)$SampleType)
str(sample_data(ps4)$SampleType)
#> int [1:26] 8 8 8 1 1 7 7 7 9 9 ...
ps4m = merge_samples(ps4, "SampleType")
#> Error in validObject(.Object): invalid class "phyloseq" object:
#> Component sample names do not match.
#> Try sample_names()
#sample_data(ps4m)
# Change ps2 SampleType to factor
sample_data(ps2)$SampleType = as.factor(sample_data(ps2)$SampleType)
str(sample_data(ps2)$SampleType)
#> Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
ps2m = merge_samples(ps2, "SampleType")
sample_data(ps2m)
#> X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
#> Feces 19.0 13.5 13.5 16.500000
#> Freshwater 15.0 11.5 11.5 12.000000
#> Freshwater (creek) 2.0 14.0 14.0 13.000000
#> Mock 7.0 25.0 25.0 12.333333
#> Ocean 18.0 17.0 17.0 13.666667
#> Sediment (estuary) 23.0 20.0 20.0 15.000000
#> Skin 12.0 7.0 7.0 9.666667
#> Soil 10.0 2.0 2.0 13.333333
#> Tongue 14.5 9.5 9.5 15.000000
#> Barcode_full_length SampleType Description
#> Feces 13.750000 1 18.500000
#> Freshwater 4.500000 2 15.500000
#> Freshwater (creek) 6.666667 3 2.000000
#> Mock 16.000000 4 7.000000
#> Ocean 17.000000 5 18.000000
#> Sediment (estuary) 14.666667 6 22.666667
#> Skin 14.666667 7 12.000000
#> Soil 11.333333 8 9.666667
#> Tongue 23.000000 9 14.500000
# Replace the `SampleType` with actual name
sample_data(ps2m)$SampleType = rownames(sample_data(ps2m))
sample_data(ps2m)
#> X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
#> Feces 19.0 13.5 13.5 16.500000
#> Freshwater 15.0 11.5 11.5 12.000000
#> Freshwater (creek) 2.0 14.0 14.0 13.000000
#> Mock 7.0 25.0 25.0 12.333333
#> Ocean 18.0 17.0 17.0 13.666667
#> Sediment (estuary) 23.0 20.0 20.0 15.000000
#> Skin 12.0 7.0 7.0 9.666667
#> Soil 10.0 2.0 2.0 13.333333
#> Tongue 14.5 9.5 9.5 15.000000
#> Barcode_full_length SampleType Description
#> Feces 13.750000 Feces 18.500000
#> Freshwater 4.500000 Freshwater 15.500000
#> Freshwater (creek) 6.666667 Freshwater (creek) 2.000000
#> Mock 16.000000 Mock 7.000000
#> Ocean 17.000000 Ocean 18.000000
#> Sediment (estuary) 14.666667 Sediment (estuary) 22.666667
#> Skin 14.666667 Skin 12.000000
#> Soil 11.333333 Soil 9.666667
#> Tongue 23.000000 Tongue 14.500000
Created on 2023-10-06 with reprex v2.0.2
Hi @ycl6 , I greatly appreciate your comprehensive explanation and the clear, step-by-step guidance you provided. It has not only helped me grasp the entire process but has also enriched my understanding beyond the initial queries. Thank you so much!
You are right that all my categorial variables from sample data are recognized as character rather than factor.
using the following codes you just provided, I converted all categorical variables to factor before merging:
sample_data(ps)$Location = as.factor(sample_data(ps)$Location)
The rest of the steps run smoothly without any error or warning. I noticed that in the final data frame y4, all the factors have been converted to numbers. So in the ggplot(aes()) I will have to use as x= Sample instead of x= Location to generate a plot with Location as the X axis, right?
`> y1 <- tax_glom(decatur, taxrank = 'Phylum', NArm=TRUE) # agglomerate taxa
y2 <- merge_samples(y1, "Location") # merge samples on sample variable of interest y3 <- transform_sample_counts(y2, function(x) x/sum(x)*100) #get abundance in % y4 <- psmelt(y3) # create dataframe from phyloseq object y4$Phylum <- as.character(y4$Phylum) #convert to character mean<- ddply(y4, ~Phylum, function(x) c(mean=mean(x$Abundance))) # group dataframe by Phylum, calculate mean rel. abundance Other <- mean[mean$mean <= 1,]$Phylum # find Phyla whose rel. abund. is less than 1% y4[y4$Phylum %in% Other,]$Phylum <- 'Other' # change their name to "Other" y4`
`#plot a stacked barplot StackedBarPlot_phylum <- y4 %>% ggplot(aes(x =Sample, y = Abundance, fill = Phylum)) + geom_bar(stat = "identity") + labs(x = "Sampling sites", y = "Relative abundance (%)", title = "") + scale_fill_manual(values=c('khaki4','seagreen','sienna1', 'slateblue', 'skyblue'))+
theme( axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),axis.title.x=element_text(size = 15), axis.text.y = element_text(size = 12), axis.title.y=element_text(size = 15), legend.text = element_text(size = 10), legend.position = "right", strip.text = element_text(size = 12) )+guides(fill=guide_legend(ncol =1))
StackedBarPlot_phylum ` I got a nice stacked barplot, but would like to make a little change as depicted below:
My follow-up question is how to add my two County names as two facets, and sort the position according to County as shown in the attached pic?
Hi all,
I have a phyloseq object ps containing ID for over 200 samples and 10 sampling sties. Before plotting anything, I normalized the sampling depth by following the DEseq approach as shown below:
median(sample_sums(ps)) total = median(sample_sums(ps)) standf = function(x, t=total) round(t * (x / sum(x))) normalized_ps = transform_sample_counts(ps, standf)
Then I tried to plot the relative phylum abundance by sampling site:
Merge everything to the phylum level
normalized_ps_phylum <- tax_glom(normalized_ps, "Phylum", NArm = TRUE)
Transform Taxa counts to relative abundance
normalized_ps_phylum_relabun <- transform_sample_counts(normalized_ps_phylum, function(OTU) OTU/sum(OTU) * 100)
Then extract the data from the phyloseq object:
taxa_abundance_table_phylum <- psmelt(normalized_ps_phylum_relabun)
plot a stacked bar plot
StackedBarPlot_phylum <- taxa_abundance_table_phylum %>% ggplot(aes(x =Site, y = Abundance, fill = Phylum)) + geom_bar(stat = "identity") + labs(x = "", y = "Relative Abundance", title = "Phylum Relative Abundance")
StackedBarPlot_phylum
I got the stacked barplot for phylum abundance, but what I want is relative abundance of phylum. When I changed the "x=Site" to "x=Sample" within ggplot(aes()), it worked, but the X axis label will be sample ID rather than the desired sampling sites.
Anyone knows how to fix it? Thank you in advance.
Best, Gabriel