Loop for Subsetting Phyloseq by Samples?

phorve commented 4 years ago

Hi all,

I am hoping to create a compositional barplot for each individual sample in a study. I have my data in a phyloseq, and I was able to create the compositional barplot for the top25 taxa across all samples using the code below:

view_final_vst is my normalized phyloseq object.

## get top 25 most abundant Families, merge remaining Families into "Other," and glom by Family
# Make a compositional bar plot
sort.tax <- sort(tapply(taxa_sums(view_final_vst), tax_table(view_final_vst)[, "Genus"], sum), TRUE)
length(sort.tax)
top.tax <- sort.tax[1:25] #what are the top 25 most abundant Families across the whole study?
write.csv(top.tax, "top25Taxa.csv")

bottom.tax <- sort.tax[26:length(sort.tax)] 
top.sub <- subset_taxa(view_final_vst, Genus %in% names(top.tax)) #get top 25 most abundant Genus
bottom.sub <- subset_taxa(view_final_vst, Genus %in% names(bottom.tax)) #get all other taxa
bottom.sub.m <- merge_taxa(view_final_vst, taxa_names(bottom.sub), archetype=1) #merge all other taxa into Genus "Other"
tax_table(bottom.sub.m)[,6][is.na(tax_table(bottom.sub.m)[,6])] <- "Other"
bottom.sub.m <- tax_glom(bottom.sub.m, taxrank="Genus")
match()
# Replace row name with genus 
## stacked barplots to compare proportional composition 
bottom.sub.rel <- transform_sample_counts(bottom.sub.m, function(x) 100 * x/sum(x))
df_long <- psmelt(bottom.sub.rel) # first change to long format
old.lvl <- levels(df_long$Genus)
df_long$Genus <- factor(df_long$Genus, levels=c("Other", sort(old.lvl[old.lvl!="Other"], decreasing=F)))

## Plot
ggplot(df_long, aes(x=Sample_name, y=Abundance, fill = Genus)) + 
  geom_bar(stat="identity", position="stack") + 
  labs(x="Sample", y="Percentage of Sequences") +
  scale_fill_manual(values=genusPal, na.value="darkgrey") +
  scale_color_manual(values=genusPal, na.value="darkgrey") + 
  theme(axis.title=element_text(size=10), axis.text.x=element_text(angle=90, size=6, hjust=1, vjust=0.5),
        legend.text=element_text(size=8), legend.key.size = unit(0.16,"in"), legend.position = "bottom",
        legend.title=element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  labs(title = "Top 25 Taxa")

However, this just finds the top 25 taxa in the phyloseq object that I created. I was hoping to be able to make a loop that will run through each sample name in the phyloseq object and create an individual phyloseq object for each sample, then plot that phyloseq using the code above. However, I have been having difficulty figuring out how this loop would work, especially with phyloseq.

I did some searching through the previous issues and found a way to view the top 25 in each sample ( #847 ), but this does not allow me (at least as to my knowledge) to convert into a phyloseq object with abundance data.

I would really appreciate any help in trying pt put this loop together. I would also imagine that this could be helpful for individuals in the future that want to find the top however many taxa in individual samples.

Thanks in advance for any help!

jeaninebr commented 4 years ago

Hi phorve Probably you found a solution by now, however, I have a somewhat similar issue, but for different reasons.
In my case, I have 4 replicates (which are 1 sample each in my phyloseq object) and I want to merge them, but only taking OTUs that are present in at least 2 of the 4 replicates. With intersect and uncion, I manage to get the taxa_names of these OTUs, and I can subset samples in a loop, but I don't know how to reassemble these cropped samples into a new object. :/

What I did find is that you can create a 'for' loop and store the phyloseq object in a list.

for (i in 1:length(samples)) {

Choose one sample

site_i <- subset_samples(phylo, sample == sample[i]) ....

top25 is a list with the names of top25 otus in that sample, keep only those

Top25 <- prune_taxa(top25, site_i)

you can store the sample with only the top 25 OTUs in a "list"

pruned_data[[i]] <- Top25 }

However, unlike with normal lists, where you could simply use rbind to unlist this object, it does not work with phyloseq objects. I tried to write a loop for merging the objects, like the following:

merge each of the created phyloseq object with the previous one and overwrite

for (i in 1:length(samples)) {

take the first and second sample and merge them, assign to new phyloseq object

p_new <- phyloseq(pruned_data[[i]], pruned_data[[i+1]])

assign this new object as the first one in the next iteration

pruned_data[[i+1]] <- p_new }

However, it tells me that the tree cannot be merged, because of course, individual samples do not contain all tips... I really don't know how to work around. I think about dropping the phyloseq framework and work with normal tables instead... ?! I'd be glad to hear about your ideas.

cyklee commented 2 years ago

@phorve, subset_samples() and subset_taxa() use non-standard evaluation (see issue #752), which means you should only use it for interactive (i.e. not in loop) use. My solution to that is to first generate a logical vector based on whatever critieria you have, in this case, we are subsetting by SampleName. There is probably a much better way than

names <- sample_names(ps)

for (name in names){
    single_sample <- prune_samples(name, ps, ) # This is your single sample

    # Here you can do the top 25 ASV code, but avoid using subset_taxa()
    # Replace it with prune_taxa()
}

joey711 / phyloseq