joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

Find the Most abundant Taxa in individual samples #847

Open smajor1 opened 6 years ago

smajor1 commented 6 years ago

Hello, First, my obligatory brown-nosing: I love phyloseq and have been getting pretty intimate with it the past week. I've been trying to transition over to using R for my informatics work because I find using QIIME on the virtual machine kind of clunky and this package is everything I needed! Anyway, I have a phyloseq object with some merged samples and I'm trying to find the most abundant taxa for each sample. I have 28 samples and 5000 otus. I generated the code...

top.pdy.phyl <- tax_glom(top.pdy, "Phylum")
abund<- names(sort(taxa_sums(top.pdy.phyl),T, function(x) which(max==x)))
top <-  prune_taxa(abund,top.pdy.phyl)

...thinking this would look at the taxa sums, go through each sample (1-28), identify which agglomerated OTU representative is the most abundant in each sample, then make a new object with only those most abundant OTUs. The result was a little different than I expected; it returned an object with 33 taxa (33 different phyla). I inspected the otu_table and compared the OTU id's to the corresponding ID's in the tax_table and found that most of these identified OTUs are nowhere near the most abundant, though the highly abundant OTUs do appear to be there. Does any one have any ideas how to reorganize this and then subsequently explicitly show which taxa is most abundant in which sample?

Thanks a bunch!

smajor1 commented 6 years ago

I created my first function to get this done! I also shared it on the "issues" board for other people to use.

# function to find the most abundant taxa
# Goes through a phyloseq object, picks out the most abundant taxa and gives the abundance for each
# and identifies which taxa is most abundant for which sample
# 
find.top.taxa <- function(x,taxa){
  require(phyloseq)
  top.taxa <- tax_glom(x, taxa)
  otu <- otu_table(t(top.taxa)) # remove the transformation if using a merge_sample object
  tax <- tax_table(top.taxa)
  j<-apply(otu,1,which.max)
  k <- j[!duplicated(j)]
  l <- data.frame(tax[k,])
  m <- data.frame(otu[,k])
  s <- as.name(taxa)
  colnames(m) = l[,taxa]
  n <- colnames(m)[apply(m,1,which.max)]
  m[,taxa] <- n
  return(m)
}
find.top.taxa(top.pdy,"Phylum")
sugll commented 6 years ago

how to find the relative abundant Taxa of top 25 in individual samples?

cyklee commented 5 years ago

@smajor1, thanks for the handy code!

For those who stumbled upon this topic. It appeared phyloseq has changed its object structure slightly (or how data.frame interacts with it), and the data matrix has to be accessed via @.Data as per #983 :

find.top.taxa <- function(x,taxa){
  require(phyloseq)
  top.taxa <- tax_glom(x, taxa)
  otu <- otu_table(top.taxa) # remove the transformation if using a merge_sample object
  tax <- tax_table(top.taxa)
  j<-apply(otu,1,which.max)
  k <- j[!duplicated(j)]
  l <- data.frame(tax@.Data[k,]) # note the modification here and the line below
  m <- data.frame(otu@.Data[,k])
  s <- as.name(taxa)
  colnames(m) = l[,taxa]
  n <- colnames(m)[apply(m,1,which.max)]
  m[,taxa] <- n
  return(m)
}

@sugll - perhaps this may offer some ideas in place of which.max.

cyklee commented 5 years ago

@sugll and others who may be interested in finding more than the top taxon per sample, here's my (unverified) contribution:

find.top.taxa2 <- function(x,taxa,num){
  require(phyloseq)
  require(magrittr)

  top.taxa <- tax_glom(x, taxa)
  otu <- otu_table(top.taxa) # remove the transformation if using a merge_sample object
  tax <- tax_table(top.taxa)
  j1 <- apply(otu,1,sort,index.return=T, decreasing=T) # modifying which.max to return a list of sorted index
  j2 <- lapply(j1,'[[',"ix") # select for index

  #k <- j[!duplicated(j)] # Replaced with unique() below
  l <- data.frame(unique(tax@.Data[unlist(j2),]))
  m <- data.frame(otu@.Data[,unique(unlist(j2))])
  #s <- as.name(taxa) # This isn't necessary!
  colnames(m) = l[,taxa]
  n <- apply(m,1,sort,index.return=T, decreasing=T) %>%
    lapply('[[',"ix") %>%  # Extract index
    lapply(head,n=num) # This to returns the top x tax
  # I want to apply a list of list of index to obtains a list of list of translated names

  # https://stackoverflow.com/questions/31561238/lapply-function-loops-on-list-of-lists-r
  p <- list()
  for(i in 1:length(n)){
    p[[i]]<- colnames(m)[n[[i]]]
  }
  m$taxa <- p # replacing [,taxa], but now the new column is called "taxa" instead of the inputted taxonomic rank
  return(m)
}

# Test
find.top.taxa2(ps,"Genus",25)$taxa # Here are the top 25 most abundant genera per sample

Relative abundance transformation isn't really necessary since the comparison is per sample. With thee development of ASV methods (e.g. DADA2) perhaps it would be useful to expand this code to output at ASV/OTU level, particularly after subsetting the dataset for a particular genus. I may update the above code to support that as well as tidying up the variable names.

cyklee commented 5 years ago
# Function to find the top most abundant ASV/OTU per sample, number of ASV/OTU per sample depends on user input. This is particularly relevant for finding strain level community structure when a few genera dominates the communities, for  example "is it a single variant of Pseudomonas dominating all the samples?"

find.top.asv <- function(x,num){
  require(phyloseq)
  require(magrittr)

  otu <- otu_table(x)
  tax <- tax_table(x)
  j1 <- apply(otu,1,sort,index.return=T, decreasing=T) # modifying which.max to return a list of sorted index
  j2 <- lapply(j1,'[[',"ix") # select for index

  l <- data.frame(unique(tax@.Data[unlist(j2),]))
  m <- data.frame(otu@.Data[,unique(unlist(j2))])
  n <- apply(m,1,sort,index.return=T, decreasing=T) %>%
    lapply('[[',"ix") %>%  # Extract index
    lapply(head,n=num) # This to returns the top x tax

  p <- list()
  for(i in 1:length(n)){
    p[[i]]<- colnames(m)[n[[i]]]
  }
  m$taxa <- p
  return(m)
}

# psedudo <- subset_taxa(asv.subset, Genus=="Pseudomonas") # Subset taxa to only Pseudomonas
# test <- find.top.asv(pseudo, 5) # Top 5 Pseudomonas ASVs per sample
# sort(test["sample-foo",-ncol(test)], decreasing = T) # Confirmation

# You should do some filtering to remove nonsensical results:
# test[rowSums(test[,-ncol(test)]) != 0,]   # remove samples without the taxa
# test[rowSums(test[,-ncol(test)]) > 1000,] # remove samples with less than 1000 total reads
pauGuas commented 2 years ago

Hi, I know this is from a long time ago but I am trying to figure out how I can know which sample is which when I get the output from this code.

pauGuas commented 2 years ago

Okay, I asked the great Dr. John Quensen for help with this code. Sorry my formatting is obnoxious, I'm not great at these things. The "find.top.taxa2" above doesn't add sample names to the output, so he edited the code to read as such:

`p <- ps

find.top.taxa2 <- function(x,taxa,num){ require(phyloseq) require(magrittr)

top.taxa <- tax_glom(x, taxa) otu <- otu_table(top.taxa) # remove the transformation if using a merge_sample object tax <- tax_table(top.taxa) j1 <- apply(otu,1,sort,index.return=T, decreasing=T) # modifying which.max to return a list of sorted index j2 <- lapply(j1,'[[',"ix") # select for index

k <- j[!duplicated(j)] # Replaced with unique() below

l <- data.frame(unique(tax@.Data[unlist(j2),])) m <- data.frame(otu@.Data[,unique(unlist(j2))])

s <- as.name(taxa) # This isn't necessary!

colnames(m) = l[,taxa] n <- apply(m,1,sort,index.return=T, decreasing=T) %>% lapply('[[',"ix") %>% # Extract index lapply(head,n=num) # This to returns the top x tax

I want to apply a list of list of index to obtains a list of list of translated names

https://stackoverflow.com/questions/31561238/lapply-function-loops-on-list-of-lists-r

p <- list() for(i in 1:length(n)){ p[[i]]<- colnames(m)[n[[i]]] } m$taxa <- p # replacing [,taxa], but now the new column is called "taxa" instead of the inputted taxonomic rank return(m) }

top6 <- find.top.taxa2(x=p, taxa="Genus", n=6)

rslt <- top6[, "taxa"] dd <- matrix(unlist(rslt), nrow=6) # n=6 because you wanted the top 6 taxa per sample. colnames(dd) <- rownames(top6) top6.a <- t(dd) write.csv(top6.a, "D:\Sequencing data\top6.a.csv")`