joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
586 stars 186 forks source link

How I can plot relative abundance by sampling site instead of by sample ID? #1701

Open Ecotone23 opened 1 year ago

Ecotone23 commented 1 year ago

Hi all,

I have a phyloseq object ps containing ID for over 200 samples and 10 sampling sties. Before plotting anything, I normalized the sampling depth by following the DEseq approach as shown below:

median(sample_sums(ps)) total = median(sample_sums(ps)) standf = function(x, t=total) round(t * (x / sum(x))) normalized_ps = transform_sample_counts(ps, standf)

Then I tried to plot the relative phylum abundance by sampling site:

Merge everything to the phylum level

normalized_ps_phylum <- tax_glom(normalized_ps, "Phylum", NArm = TRUE)

Transform Taxa counts to relative abundance

normalized_ps_phylum_relabun <- transform_sample_counts(normalized_ps_phylum, function(OTU) OTU/sum(OTU) * 100)

Then extract the data from the phyloseq object:

taxa_abundance_table_phylum <- psmelt(normalized_ps_phylum_relabun)

plot a stacked bar plot

StackedBarPlot_phylum <- taxa_abundance_table_phylum %>% ggplot(aes(x =Site, y = Abundance, fill = Phylum)) + geom_bar(stat = "identity") + labs(x = "", y = "Relative Abundance", title = "Phylum Relative Abundance")

StackedBarPlot_phylum

I got the stacked barplot for phylum abundance, but what I want is relative abundance of phylum. When I changed the "x=Site" to "x=Sample" within ggplot(aes()), it worked, but the X axis label will be sample ID rather than the desired sampling sites.

Anyone knows how to fix it? Thank you in advance.

Best, Gabriel

ycl6 commented 1 year ago

Hi @Ecotone23

What you saw in your original plot with x=Site is the summed relative abundance from all the samples from each site, hence it can get greater than 100% if you have more than one sample in one site and also if they have quite high relative abundance.

To create your desired plot (I think that's what you meant), you will merge samples from the same site together, using merge_samples(), much like what we do with tax_glom() where taxa from the same phylum are merged (summed).

    library(phyloseq)
    library(ggplot2)
    library(magrittr)
    data(GlobalPatterns)

    # To simplify the GP data
    ps = subset_taxa(GlobalPatterns, Kingdom == "Bacteria")
    ps = prune_taxa(taxa_sums(ps) > 1000, ps)

    total = median(sample_sums(ps))
    standf = function(x, t=total) round(t * (x / sum(x)))
    normalized_ps = transform_sample_counts(ps, standf)

    # Merge sample by their sampling site
    # Here we have 9 "samples" (SampleType)
    merged_ps_phylum = merge_samples(normalized_ps, "SampleType") # summed
    merged_ps_phylum
    #> phyloseq-class experiment-level object
    #> otu_table()   OTU Table:         [ 1973 taxa and 9 samples ]
    #> sample_data() Sample Data:       [ 9 samples by 7 sample variables ]
    #> tax_table()   Taxonomy Table:    [ 1973 taxa by 7 taxonomic ranks ]
    #> phy_tree()    Phylogenetic Tree: [ 1973 tips and 1972 internal nodes ]

    # Sample Data has Sample ID per row
    normalized_ps_phylum1 = tax_glom(normalized_ps, "Phylum", NArm = TRUE)
    normalized_ps_phylum1
    #> phyloseq-class experiment-level object
    #> otu_table()   OTU Table:         [ 26 taxa and 26 samples ]
    #> sample_data() Sample Data:       [ 26 samples by 7 sample variables ]
    #> tax_table()   Taxonomy Table:    [ 26 taxa by 7 taxonomic ranks ]
    #> phy_tree()    Phylogenetic Tree: [ 26 tips and 25 internal nodes ]

    # Sample Data has Sample Type per row
    normalized_ps_phylum2 = tax_glom(merged_ps_phylum, "Phylum", NArm = TRUE)
    normalized_ps_phylum2
    #> phyloseq-class experiment-level object
    #> otu_table()   OTU Table:         [ 26 taxa and 9 samples ]
    #> sample_data() Sample Data:       [ 9 samples by 7 sample variables ]
    #> tax_table()   Taxonomy Table:    [ 26 taxa by 7 taxonomic ranks ]
    #> phy_tree()    Phylogenetic Tree: [ 26 tips and 25 internal nodes ]

    normalized_ps_phylum1_relabun = transform_sample_counts(normalized_ps_phylum1, function(OTU) OTU/sum(OTU) * 100)
    normalized_ps_phylum2_relabun = transform_sample_counts(normalized_ps_phylum2, function(OTU) OTU/sum(OTU) * 100)

    psmelt(normalized_ps_phylum1_relabun) %>%
            ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
            geom_bar(stat = "identity") +
            theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
            labs(y = "Relative Abundance", title = "Phylum Relative Abundance")

    psmelt(normalized_ps_phylum2_relabun) %>%
            ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) +
            geom_bar(stat = "identity") +
            theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
            labs(y = "Relative Abundance", title = "Phylum Relative Abundance")

Created on 2023-09-29 with reprex v2.0.2

Ecotone23 commented 1 year ago

Absolutely @ycl6, it's working like a charm! I truly appreciate your troubleshooting assistance – you're the real expert here. I have a follow-up question:

For clarity, I'm beginning with my phyloseq object (ps) and opting not to perform normalization. My objective is to group phyla with relative abundances less than 1% into a new category named "Other." I've employed the following code to achieve this:

merge OTUs by phylum level

ps_phylum <- tax_glom(ps, "Phylum", NArm = TRUE)

Transform Taxa counts to relative abundance

ps_phylum_relabun <- transform_sample_counts(ps_phylum, function(x) x/sum(x) )

Then extract the data from the phyloseq object:

taxa_abundance_table_phylum <- psmelt(ps_phylum_relabun)

convert it to a character

taxa_abundance_table_phylum$Phylum <- as.character(taxa_abundance_table_phylum$Phylum)

group dataframe by Phylum, calculate median relative abundance

medians <- ddply(taxa_abundance_table_phylum, ~Phylum, function(x) c(median=median(x$Abundance)))

find Phyla whose relative abundance is less than 1%

remainder <- medians[medians$median <= 0.01,]$Phylum

remainder #here, only one phylum, Proteobacteria,shows a relative abundance greater than 1%

change their name to "Other"

taxa_abundance_table_phylum1[taxa_abundance_table_phylum1$Phylum %in% remainder,]$Phylum <- "Other"

All phyla, except Proteobacteria, have been listed by "remainder" since their relative abundances are less than 1%. However, when I employ the codes you provided in comment #1521 to generate a relative abundance table, it results in the following output:

write.table(ps %>% tax_glom(taxrank = "Phylum") %>% transform_sample_counts(function(x) {x/sum(x)}) %>% psmelt() %>% select(Phylum, Sample, Abundance) %>% spread(Sample, Abundance), file = "relative_abundance.phylum.tsv", sep = "\t", quote = F, row.names = F, col.names = T)

Screenshot 2023-09-29 at 5 57 26 PM

The column "relative abundance(%)" was added manually and calculated as the sum of the values in each row divided by number of samples and times 100: sum(X1:Xn)/N*100

There are six phyla with relative abundances exceeding 1%, and the total of these relative abundances for each phylum adds up to 100%. Therefore, Is it possible that the previous calculation using "medians" and "remainder" was incorrect?

Ecotone23 commented 1 year ago

"Therefore, Is it possible that the previous calculation using "medians" and "remainder" was incorrect?" --The issue appears to be related to the use of the median function, which is returning the median value for each phylum instead of the mean value. I'm not entirely certain whether the median function considers the relative abundance of the median-ranking OTU among all OTUs belonging to a particular phylum.

I also tried the following R codes to group the phyla with relative abundance less than 1% and generated the stacked barplot:

Convert all variables in the sample_data to factors; otherwise, they will be coerced into NA values when executing the merge_samples() function.

df <- as.data.frame(lapply(sample_data(ps),function (y) if(class(y)!="factor" ) as.factor(y) else y),stringsAsFactors=T) row.names(df) <- sample_names(ps) sample_data(ps) <- sample_data(df)

group less abundant phyla as "Other"

y1 <- merge_samples(ps, "Site") # merge samples on sample variable of interest y2 <- tax_glom(y1, taxrank = 'Phylum', NArm=TRUE) # agglomerate taxa y3 <- transform_sample_counts(y2, function(x) x/sum(x)*100) #get abundance in % y4 <- psmelt(y3) # create dataframe from phyloseq object y4$Phylum <- as.character(y4$Phylum) #convert to character y4$Phylum[y4$Abundance <= 1] <- "Other" #rename genera with < 1% abundance

plot a stacked barplot

StackedBarPlot_phylum <- y4 %>% ggplot(aes(x =Site, y = Abundance, fill = Phylum)) + geom_bar(stat = "identity") + labs(x = "Sampling sites", y = "Relative abundance (%)", title = "") +
theme( axis.text.x = element_text(size = 12, angle = 90, vjust = 0.5, hjust = 1),axis.title.x=element_text(size = 15), axis.text.y = element_text(size = 12), axis.title.y=element_text(size = 15), legend.text = element_text(size = 10), legend.position = "right", strip.text = element_text(size = 12) )+guides(fill=guide_legend(ncol =1))

StackedBarPlot_phylum

I successfully generated the desired barplot with relative abundance on the Y-axis and sampling sites on the X-axis. However, two new issues have emerged:

Instead of displaying the names of the sampling sites as characters, only numerical values are appearing. The mean relative abundance for the seven phyla displayed in the barplot is expected to be greater than 1%. However, in the provided TSV table, only six phyla have values exceeding 1%. "Epsilonbacteraeota" and "Acidobacteria," which are retained in the barplot, are supposed to have abundances of 0.83% and 0.44%, respectively, according to the TSV table. Is this command (y3 <- transform_sample_counts(y2, function(x) x/sum(x)*100) ) still not getting the mean relative abundance?

Screenshot 2023-10-02 at 5 52 20 PM

@ycl6 Can you help solve the problems? Thank you in advance.

ycl6 commented 1 year ago

Hi @Ecotone23

I think if the goal is to group the lowly abundant phyla, one way is to simply select to show top N phyla in your plot. There's a similar discussion here #1487 and I have provided some sample code.

    library(phyloseq)
    library(tidyverse)
    data(GlobalPatterns)

    # To simplify the GP data
    ps = subset_taxa(GlobalPatterns, Kingdom == "Bacteria")
    ps = prune_taxa(taxa_sums(ps) > 1000, ps)

    # Merge sample by their sampling site
    merged_ps = merge_samples(ps, "SampleType") # summed

    # Merge taxa by 'Phylum' rank
    normalized_ps_phylum = tax_glom(merged_ps, "Phylum", NArm = TRUE)

    # Calculate relative abundance
    normalized_ps_phylum_relabun = transform_sample_counts(normalized_ps_phylum, function(OTU) OTU/sum(OTU))

    # Show taxa_sums
    taxaSums = data.frame(tax_table(normalized_ps_phylum_relabun)[,"Phylum"],
                          taxa_sums = taxa_sums(normalized_ps_phylum_relabun)) %>%
            arrange(desc(taxa_sums)) # reverse sort
    taxaSums
    #>                  Phylum    taxa_sums
    #> 360229   Proteobacteria 3.0133407264
    #> 331820    Bacteroidetes 1.5118082866
    #> 189047       Firmicutes 1.3832461922
    #> 549656    Cyanobacteria 1.3818194592
    #> 329744   Actinobacteria 0.8098169731
    #> 319505  Verrucomicrobia 0.2735084825
    #> 36155     Acidobacteria 0.2347197201
    #> 64396      Fusobacteria 0.1185165676
    #> 513590   Planctomycetes 0.0788220097
    #> 357591      Tenericutes 0.0640553136
    #> 284052      Chloroflexi 0.0476642440
    #> 593891 Gemmatimonadetes 0.0127665437
    #> 591182           SAR406 0.0119666800
    #> 337082         Chlorobi 0.0111134726
    #> 306921             SPAM 0.0102539984
    #> 114339              AD3 0.0084536592
    #> 144091              WS3 0.0057790605
    #> 523166             GN04 0.0051596590
    #> 252035      Nitrospirae 0.0040368527
    #> 535000              OP8 0.0034604943
    #> 184899              SC4 0.0032217200
    #> 212910     Spirochaetes 0.0016595890
    #> 112742            WPS-2 0.0015506052
    #> 312378    Elusimicrobia 0.0011932614
    #> 368928    Lentisphaerae 0.0011906972
    #> 589390  Armatimonadetes 0.0008757319

    # Select top10 phyla
    top10 = head(rownames(taxaSums), 10)
    top10
    #>  [1] "360229" "331820" "189047" "549656" "329744" "319505" "36155"  "64396" 
    #>  [9] "513590" "357591"

    # Remove unselected phyla
    y = prune_taxa(top10, normalized_ps_phylum_relabun)
    y
    #> phyloseq-class experiment-level object
    #> otu_table()   OTU Table:         [ 10 taxa and 9 samples ]
    #> sample_data() Sample Data:       [ 9 samples by 7 sample variables ]
    #> tax_table()   Taxonomy Table:    [ 10 taxa by 7 taxonomic ranks ]
    #> phy_tree()    Phylogenetic Tree: [ 10 tips and 9 internal nodes ]

    # Build data.frame
    df1 = data.frame(ID = c(taxa_names(y), "Other"), Phylum = c(tax_table(y)[,"Phylum"], "Other"))
    df1
    #>        ID          Phylum
    #> 1  329744  Actinobacteria
    #> 2   64396    Fusobacteria
    #> 3  357591     Tenericutes
    #> 4  549656   Cyanobacteria
    #> 5  360229  Proteobacteria
    #> 6  331820   Bacteroidetes
    #> 7   36155   Acidobacteria
    #> 8  319505 Verrucomicrobia
    #> 9  513590  Planctomycetes
    #> 10 189047      Firmicutes
    #> 11  Other           Other

    df2 = t(cbind(otu_table(y), data.frame(Other = 1 - sample_sums(y))))
    df2
    #>               Feces   Freshwater Freshwater (creek)         Mock        Ocean
    #> 329744 2.827556e-02 3.323764e-01       1.483344e-02 8.481134e-02 5.920611e-02
    #> 64396  2.173577e-05 8.554674e-05       4.024297e-05 3.546782e-05 1.088135e-04
    #> 357591 1.269868e-02 8.585336e-06       5.317821e-05 4.875208e-02 1.591687e-05
    #> 549656 3.832212e-02 3.692038e-01       6.742414e-01 1.766858e-03 1.927512e-01
    #> 360229 2.007726e-02 1.536416e-01       1.756737e-01 2.008655e-01 3.532960e-01
    #> 331820 4.302774e-01 1.097908e-01       4.799919e-02 2.517705e-01 3.564371e-01
    #> 36155  6.146590e-05 9.106588e-05       8.539106e-03 5.861524e-04 7.813734e-05
    #> 319505 3.558252e-03 2.153601e-02       2.589040e-02 5.600182e-04 1.929356e-02
    #> 513590 3.082203e-05 1.117504e-02       4.325093e-03 1.536939e-04 6.351987e-03
    #> 189047 4.654547e-01 1.367828e-03       4.281359e-03 4.105043e-01 5.955802e-04
    #> Other  1.222013e-03 7.233146e-04       4.412293e-02 1.941397e-04 1.186559e-02
    #>        Sediment (estuary)         Skin         Soil       Tongue
    #> 329744       0.0050018788 0.1538367399 0.0693361486 6.213938e-02
    #> 64396        0.0001294127 0.0579821390 0.0027240102 5.738920e-02
    #> 357591       0.0016044056 0.0004512294 0.0001322885 3.389552e-04
    #> 549656       0.0413762027 0.0455025671 0.0092756357 9.379683e-03
    #> 360229       0.8239644255 0.3036931707 0.3275282541 6.546008e-01
    #> 331820       0.0470968677 0.0749382057 0.0946892096 9.880904e-02
    #> 36155        0.0068744650 0.0012088284 0.2172078998 7.259891e-05
    #> 319505       0.0045902529 0.0008608324 0.1966244304 5.947341e-04
    #> 513590       0.0170466151 0.0002422652 0.0393710069 1.254855e-04
    #> 189047       0.0209929228 0.3610409245 0.0026781982 1.163304e-01
    #> Other        0.0313225510 0.0002430977 0.0404329179 2.197199e-04

    df = cbind(df1, df2) %>%
            pivot_longer(-c(ID, Phylum), names_to = "SampleType", values_to = "Abundance") %>%
            as.data.frame
    head(df)
    #>       ID         Phylum         SampleType   Abundance
    #> 1 329744 Actinobacteria              Feces 0.028275563
    #> 2 329744 Actinobacteria         Freshwater 0.332376375
    #> 3 329744 Actinobacteria Freshwater (creek) 0.014833435
    #> 4 329744 Actinobacteria               Mock 0.084811341
    #> 5 329744 Actinobacteria              Ocean 0.059206113
    #> 6 329744 Actinobacteria Sediment (estuary) 0.005001879

    # Re-level Phylum by abundance, "Other" placed at the last
    df$Phylum = as.factor(df$Phylum)
    levels(df$Phylum) # Before ordering
    #>  [1] "Acidobacteria"   "Actinobacteria"  "Bacteroidetes"   "Cyanobacteria"  
    #>  [5] "Firmicutes"      "Fusobacteria"    "Other"           "Planctomycetes" 
    #>  [9] "Proteobacteria"  "Tenericutes"     "Verrucomicrobia"
    df$Phylum = factor(df$Phylum, levels = c(taxaSums[top10,"Phylum"], "Other"))
    levels(df$Phylum) # After ordering
    #>  [1] "Proteobacteria"  "Bacteroidetes"   "Firmicutes"      "Cyanobacteria"  
    #>  [5] "Actinobacteria"  "Verrucomicrobia" "Acidobacteria"   "Fusobacteria"   
    #>  [9] "Planctomycetes"  "Tenericutes"     "Other"

    ggplot(df, aes(SampleType, Abundance, fill = Phylum)) +
            geom_col(color = "black") + scale_fill_brewer(palette = "Set3") +
            scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
            theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
            labs(x = "Sample type", y = "Average relative abundance")

If you do want to calculate the median (or mean) and use that as a threshold, you can do that from the OTU table, for example:

    # Calculate the median relative abundance of each phylum from OTU table
    median_abd = apply(otu_table(normalized_ps_phylum_relabun) * 100, 2, median)
    median_abd 
    #>       329744       212910        64396       357591       549656       360229 
    #> 6.213938e+00 3.088447e-04 1.088135e-02 3.389552e-02 4.137620e+00 3.036932e+01 
    #>       337082       331820       591182       593891       535000       312378 
    #> 1.132431e-03 9.880904e+00 8.400274e-04 9.490802e-03 3.846300e-04 0.000000e+00 
    #>       306921        36155       252035       144091       523166       319505 
    #> 1.165537e-03 5.861524e-02 5.769450e-04 1.041831e-03 1.226477e-04 4.590253e-01 
    #>       368928       513590       112742       589390       284052       184899 
    #> 5.787951e-05 4.325093e-01 1.442362e-04 6.222425e-05 1.183636e-02 4.807875e-05 
    #>       114339       189047 
    #> 1.665053e-04 2.099292e+00

    median_abd[median_abd < 1]
    #>       212910        64396       357591       337082       591182       593891 
    #> 3.088447e-04 1.088135e-02 3.389552e-02 1.132431e-03 8.400274e-04 9.490802e-03 
    #>       535000       312378       306921        36155       252035       144091 
    #> 3.846300e-04 0.000000e+00 1.165537e-03 5.861524e-02 5.769450e-04 1.041831e-03 
    #>       523166       319505       368928       513590       112742       589390 
    #> 1.226477e-04 4.590253e-01 5.787951e-05 4.325093e-01 1.442362e-04 6.222425e-05 
    #>       284052       184899       114339 
    #> 1.183636e-02 4.807875e-05 1.665053e-04

Created on 2023-10-04 with reprex v2.0.2

Ecotone23 commented 1 year ago

Thank you @ycl6 for the nice codes! I think it will solve most of the problems. While running the merge_samples(ps, "SampleType"), all my categorical variables(including Site) from samples_df were coerced to be "NAs" with a warning: In asMethod(object) : NAs introduced by coercion. I tried to convert the Site name as factor using the following codes before merging samples:

df <- as.data.frame(lapply(sample_data(ps),function (y) if(class(y)!="factor" ) as.factor(y) else y),stringsAsFactors=T) row.names(df) <- sample_names(ps) sample_data(ps) <- sample_data(df)

and it didn't show the warning above when I ran the merge_sample() function again, but I ended up getting a plot that has absolute abundance(not relative abundance) as the y axis, which means the OTUs were not merged by sites.

I tried to give each sampling site a number to replace the string in the original sample file(i.e., sample_df) before merging as a phyloseq object. There was an error of invalid class “phyloseq” object: Component sample names do not match, when I ran the merge_sample() and also the warning of NAs coercion. I also tried to define the numeric Site as a factor, but ended up getting the same error and warning.

I wonder if this is an issue related to my R version? I tried both 4.3.1 and 4.2.3, both ended up the same. Is there another way to merge sample OTUs by sample type or other variables from sample file(i.e., samples_df)?

ycl6 commented 1 year ago

Hi @Ecotone23 #663 seems to address the same issue you had.

I didn't had this myself, but I guess (like you said), might be the data type of the variable you are trying to merge that's not quite compatible with the merge function.

What do you get when you do str(sample_data(ps)) before merging?

Ecotone23 commented 1 year ago

Hi @ycl6 ,

Just tried to do str(sample_data(ps)) before merging and NAs are still introduced by coercion. I found a useful trick that can partially solve the problem:

after merging sample type using "ps2 <- merge_samples(ps, "SampleType") " and introducing NAs, running "sample_data(ps2)$SampleType<-sample_names(ps2)" can recover the original strings for the merged variable SampleType.

Although this cannot recover all other variables from the sample_df that have been coerced to NAs, it is fair enough for generating a relative abundance of phylum(y) against SampleType(x) plot.

Yes I also found this related issue #663 and followed the codes to convert all my categorial variables to factors before merging. This can stop the NA coercion warning but the merging was not successful and I still got numbers instead of character names of my SampleType as the X axis in the plot.

I have yet to discover an ideal resolution for handling the frequently encountered NA coercion issue when executing merge_samples().

I truly appreciate all your comments and suggestions. Thanks a million!

ycl6 commented 1 year ago

Hi @Ecotone23 Sorry for not making it clear. The reason to see your str(sample_data(ps)) is so that we can figure out how your sample_data structure is different from the demo data, which works fine.

So let's do a little investigation into what works and what doesn't.

As you can see from my little test, merge_samples absolutely does not like integers or integers of a factor type. It tolerates characters, but the resulting object's SampleType becomes NA. However, it's alright because the sample names are still correctly transferred over.

If you want to avoid the warning message, you can follow the last step where I change it back from character type to factor. _And yes, you can treat sample_data(ps) like a data.frame_ ;)

    library(phyloseq)
    data(GlobalPatterns)

    # To simplify the GP data
    ps = subset_taxa(GlobalPatterns, Kingdom == "Bacteria")
    ps = prune_taxa(taxa_sums(ps) > 1000, ps)

    str(sample_data(ps))
    #> 'data.frame':    26 obs. of  7 variables:
    #> Formal class 'sample_data' [package "phyloseq"] with 4 slots
    #>   ..@ .Data    :List of 7
    #>   .. ..$ : Factor w/ 26 levels "AQC1cm","AQC4cm",..: 5 4 21 14 11 15 12 9 16 13 ...
    #>   .. ..$ : Factor w/ 26 levels "ILBC_01","ILBC_02",..: 1 2 3 4 5 6 7 8 9 10 ...
    #>   .. ..$ : Factor w/ 26 levels "AACGCA","AACTCG",..: 1 2 3 4 5 6 7 8 9 10 ...
    #>   .. ..$ : Factor w/ 26 levels "AACTGT","ACAGGT",..: 24 13 3 20 9 5 17 7 19 11 ...
    #>   .. ..$ : Factor w/ 26 levels "AGAGAGACAGG",..: 10 6 18 21 8 9 16 19 26 20 ...
    #>   .. ..$ : Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
    #>   .. ..$ : Factor w/ 25 levels "Allequash Creek, 0-1cm depth",..: 4 5 20 14 11 15 12 9 16 13 ...
    #>   ..@ names    : chr  "X.SampleID" "Primer" "Final_Barcode" "Barcode_truncated_plus_T" ...
    #>   ..@ row.names: chr  "CL3" "CC1" "SV1" "M31Fcsw" ...
    #>   ..@ .S3Class : chr "data.frame"

    str(sample_data(ps)$SampleType)
    #>  Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...

    levels(sample_data(ps)$SampleType)
    #> [1] "Feces"              "Freshwater"         "Freshwater (creek)"
    #> [4] "Mock"               "Ocean"              "Sediment (estuary)"
    #> [7] "Skin"               "Soil"               "Tongue"

    ps1 = ps2 = ps3 = ps4 = ps

    # ps1, SampleType is factor type with characters
    # No error
    str(sample_data(ps1)$SampleType)
    #>  Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
    ps1m = merge_samples(ps1, "SampleType")
    sample_data(ps1m)
    #>                    X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
    #> Feces                    19.0   13.5          13.5                16.500000
    #> Freshwater               15.0   11.5          11.5                12.000000
    #> Freshwater (creek)        2.0   14.0          14.0                13.000000
    #> Mock                      7.0   25.0          25.0                12.333333
    #> Ocean                    18.0   17.0          17.0                13.666667
    #> Sediment (estuary)       23.0   20.0          20.0                15.000000
    #> Skin                     12.0    7.0           7.0                 9.666667
    #> Soil                     10.0    2.0           2.0                13.333333
    #> Tongue                   14.5    9.5           9.5                15.000000
    #>                    Barcode_full_length SampleType Description
    #> Feces                        13.750000          1   18.500000
    #> Freshwater                    4.500000          2   15.500000
    #> Freshwater (creek)            6.666667          3    2.000000
    #> Mock                         16.000000          4    7.000000
    #> Ocean                        17.000000          5   18.000000
    #> Sediment (estuary)           14.666667          6   22.666667
    #> Skin                         14.666667          7   12.000000
    #> Soil                         11.333333          8    9.666667
    #> Tongue                       23.000000          9   14.500000

    # ps2, SampleType is character type
    # Warning: NAs introduced by coercion
    sample_data(ps2)$SampleType = as.character(sample_data(ps2)$SampleType)
    str(sample_data(ps2)$SampleType)
    #>  chr [1:26] "Soil" "Soil" "Soil" "Feces" "Feces" "Skin" "Skin" "Skin" ...
    ps2m = merge_samples(ps2, "SampleType")
    #> Warning in asMethod(object): NAs introduced by coercion
    sample_data(ps2m)
    #>                    X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
    #> Feces                    19.0   13.5          13.5                16.500000
    #> Freshwater               15.0   11.5          11.5                12.000000
    #> Freshwater (creek)        2.0   14.0          14.0                13.000000
    #> Mock                      7.0   25.0          25.0                12.333333
    #> Ocean                    18.0   17.0          17.0                13.666667
    #> Sediment (estuary)       23.0   20.0          20.0                15.000000
    #> Skin                     12.0    7.0           7.0                 9.666667
    #> Soil                     10.0    2.0           2.0                13.333333
    #> Tongue                   14.5    9.5           9.5                15.000000
    #>                    Barcode_full_length SampleType Description
    #> Feces                        13.750000         NA   18.500000
    #> Freshwater                    4.500000         NA   15.500000
    #> Freshwater (creek)            6.666667         NA    2.000000
    #> Mock                         16.000000         NA    7.000000
    #> Ocean                        17.000000         NA   18.000000
    #> Sediment (estuary)           14.666667         NA   22.666667
    #> Skin                         14.666667         NA   12.000000
    #> Soil                         11.333333         NA    9.666667
    #> Tongue                       23.000000         NA   14.500000

    # ps3, SampleType is factor type with integers
    # Error: invalid class “phyloseq” object
    levels(sample_data(ps3)$SampleType) = 1:nlevels(sample_data(ps3)$SampleType)
    str(sample_data(ps3)$SampleType)
    #>  Factor w/ 9 levels "1","2","3","4",..: 8 8 8 1 1 7 7 7 9 9 ...
    ps3m = merge_samples(ps3, "SampleType")
    #> Error in validObject(.Object): invalid class "phyloseq" object: 
    #>  Component sample names do not match.
    #>  Try sample_names()
    #sample_data(ps3m)

    # ps3, SampleType is integer type
    # Error: invalid class “phyloseq” object
    sample_data(ps4)$SampleType = as.integer(sample_data(ps4)$SampleType)
    str(sample_data(ps4)$SampleType)
    #>  int [1:26] 8 8 8 1 1 7 7 7 9 9 ...
    ps4m = merge_samples(ps4, "SampleType")
    #> Error in validObject(.Object): invalid class "phyloseq" object: 
    #>  Component sample names do not match.
    #>  Try sample_names()
    #sample_data(ps4m)

    # Change ps2 SampleType to factor
    sample_data(ps2)$SampleType = as.factor(sample_data(ps2)$SampleType)
    str(sample_data(ps2)$SampleType)
    #>  Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
    ps2m = merge_samples(ps2, "SampleType")
    sample_data(ps2m)
    #>                    X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
    #> Feces                    19.0   13.5          13.5                16.500000
    #> Freshwater               15.0   11.5          11.5                12.000000
    #> Freshwater (creek)        2.0   14.0          14.0                13.000000
    #> Mock                      7.0   25.0          25.0                12.333333
    #> Ocean                    18.0   17.0          17.0                13.666667
    #> Sediment (estuary)       23.0   20.0          20.0                15.000000
    #> Skin                     12.0    7.0           7.0                 9.666667
    #> Soil                     10.0    2.0           2.0                13.333333
    #> Tongue                   14.5    9.5           9.5                15.000000
    #>                    Barcode_full_length SampleType Description
    #> Feces                        13.750000          1   18.500000
    #> Freshwater                    4.500000          2   15.500000
    #> Freshwater (creek)            6.666667          3    2.000000
    #> Mock                         16.000000          4    7.000000
    #> Ocean                        17.000000          5   18.000000
    #> Sediment (estuary)           14.666667          6   22.666667
    #> Skin                         14.666667          7   12.000000
    #> Soil                         11.333333          8    9.666667
    #> Tongue                       23.000000          9   14.500000

    # Replace the `SampleType` with actual name
    sample_data(ps2m)$SampleType = rownames(sample_data(ps2m))
    sample_data(ps2m)
    #>                    X.SampleID Primer Final_Barcode Barcode_truncated_plus_T
    #> Feces                    19.0   13.5          13.5                16.500000
    #> Freshwater               15.0   11.5          11.5                12.000000
    #> Freshwater (creek)        2.0   14.0          14.0                13.000000
    #> Mock                      7.0   25.0          25.0                12.333333
    #> Ocean                    18.0   17.0          17.0                13.666667
    #> Sediment (estuary)       23.0   20.0          20.0                15.000000
    #> Skin                     12.0    7.0           7.0                 9.666667
    #> Soil                     10.0    2.0           2.0                13.333333
    #> Tongue                   14.5    9.5           9.5                15.000000
    #>                    Barcode_full_length         SampleType Description
    #> Feces                        13.750000              Feces   18.500000
    #> Freshwater                    4.500000         Freshwater   15.500000
    #> Freshwater (creek)            6.666667 Freshwater (creek)    2.000000
    #> Mock                         16.000000               Mock    7.000000
    #> Ocean                        17.000000              Ocean   18.000000
    #> Sediment (estuary)           14.666667 Sediment (estuary)   22.666667
    #> Skin                         14.666667               Skin   12.000000
    #> Soil                         11.333333               Soil    9.666667
    #> Tongue                       23.000000             Tongue   14.500000

Created on 2023-10-06 with reprex v2.0.2

Ecotone23 commented 1 year ago

Hi @ycl6 , I greatly appreciate your comprehensive explanation and the clear, step-by-step guidance you provided. It has not only helped me grasp the entire process but has also enriched my understanding beyond the initial queries. Thank you so much!

You are right that all my categorial variables from sample data are recognized as character rather than factor.

using the following codes you just provided, I converted all categorical variables to factor before merging: sample_data(ps)$Location = as.factor(sample_data(ps)$Location)

The rest of the steps run smoothly without any error or warning. I noticed that in the final data frame y4, all the factors have been converted to numbers. So in the ggplot(aes()) I will have to use as x= Sample instead of x= Location to generate a plot with Location as the X axis, right?
`> y1 <- tax_glom(decatur, taxrank = 'Phylum', NArm=TRUE) # agglomerate taxa

y2 <- merge_samples(y1, "Location") # merge samples on sample variable of interest y3 <- transform_sample_counts(y2, function(x) x/sum(x)*100) #get abundance in % y4 <- psmelt(y3) # create dataframe from phyloseq object y4$Phylum <- as.character(y4$Phylum) #convert to character mean<- ddply(y4, ~Phylum, function(x) c(mean=mean(x$Abundance))) # group dataframe by Phylum, calculate mean rel. abundance Other <- mean[mean$mean <= 1,]$Phylum # find Phyla whose rel. abund. is less than 1% y4[y4$Phylum %in% Other,]$Phylum <- 'Other' # change their name to "Other" y4`

Screenshot 2023-10-06 at 6 08 57 PM

`#plot a stacked barplot StackedBarPlot_phylum <- y4 %>% ggplot(aes(x =Sample, y = Abundance, fill = Phylum)) + geom_bar(stat = "identity") + labs(x = "Sampling sites", y = "Relative abundance (%)", title = "") + scale_fill_manual(values=c('khaki4','seagreen','sienna1', 'slateblue', 'skyblue'))+

facet_grid(.~County, scales = "free") +

theme( axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),axis.title.x=element_text(size = 15), axis.text.y = element_text(size = 12), axis.title.y=element_text(size = 15), legend.text = element_text(size = 10), legend.position = "right", strip.text = element_text(size = 12) )+guides(fill=guide_legend(ncol =1))

StackedBarPlot_phylum ` I got a nice stacked barplot, but would like to make a little change as depicted below:

Screenshot 2023-10-06 at 5 33 04 PM

My follow-up question is how to add my two County names as two facets, and sort the position according to County as shown in the attached pic?