poyuliu / KTU

KTU: K-mer-based taxonomic clustering algorithm improves biological relevance in microbiome associated study
5 stars 0 forks source link

memory usage #3

Open csmiguel opened 2 years ago

csmiguel commented 2 years ago

Hi, KTU:klustering is using a tremendous amount of memory in my machine. Even dedicating around 30 GB of swap memory and 8Gb of RAM would not be enough for around 30k input sequences (20Mb of a plain fasta). Do you know how memory usage scales with the number of input sequences (or around 300nt)?

poyuliu commented 2 years ago

Hi,

Yes, that's a problem that we are also struggling with while inputting >20k ASVs (but the read length doesn't affect it). The PAM clustering algorithm may not be good at dealing with ultra-high dimension datasets (256 tetranucleotide frequencies * N ASVs, N > 20k). We are trying to modify our iterative searching method or find an alternative algorithm for the ktu clustering. In addition, memory usage will increase when using more cores. (Sometimes, using one core to run KTU::klustering would be available. But that is time-consuming)

Suppose your dataset contains quite distinct sample types (e.g., gut contents + seawater), which may increase sample-type-specific or group-specific ASV numbers. In that case, they may not share too many sample-type-specific ASVs among sample types. Based on that observation and assumption, a currently available option is subsetting the representative sequences by sample types to eliminate the effect of trivial sample-type. Then merging each klustering set.

Here is an example of re-analysing data from Rossbach et al. (2019) Tissue-Specific Microbiomes of the Red Sea Giant Clam Tridacna maxima Highlight Differential Abundance of Endozoicomonadaceae. This paper sequenced the digestive system, gill, and mantle tissues of giant clam (Tridacna maxima) and seawaters. There were 28777 ASVs picked from 58 samples (4 sample types). Initially, the KTU::klustering detected 9k-11k KTUs in the whole dataset. Then we implemented KTU::klustering separately for samples of clam tissues (16103 ASVs) respectively: digestive system (3664 ASVs -> 3065 KTUs), gill (4512 ASVs -> 2822 KTUs), and mantle tissues (9893 ASVs -> 5941 KTUs). After merging three subsets, 10896 KTUs were yielded. However, do not randomly subsetting. It would cause an over-clustering.

Hope this method will be available for you. If you have any other problems with running functions, I'm happy to assist.

Cheers, Po-Yu

csmiguel commented 2 years ago

Hi! thanks a lot for your insights. Happy to see there could be a way to circumvent this problem. The strategy you mention could be good but my samples are all related so they need to be analyzed altogether. I just killed the "klustering" job in a cluster that had been running for 20h and was using 83 Gb of memory. Instead, I am going to try a different strategy. Since I already have the ASVs with assigned taxonomy I will remove all of which have NO PHYLUM assigned as well as eukaryote ASVs (my dataset is bacterial 16S). That will reduce the number of ASVs in around a 40%, implying a loss of around 10% of the reads. I guess this minimal pre-filtering already reduces a lot the sparsity of the data matrix and it is ok since anyways, it is done in a later step of the bioinformatic pipeline. Cheers and looking forwards to see a KTU v2 with improved memory usage :)

poyuliu commented 2 years ago

Hi, thank you for sharing your strategy and your feedback! We'll do our best to improve the computing efficiency in the further version ASAP ;)

Best, Po-Yu

csmiguel commented 2 years ago

Hi, I came out with a workaround that works for me. After assigning taxonomic ranks to ASVs I run minimum filtering, removing ASVs with no phylum assigned or assigned to Eukaryotic organelles. Then, I run klustering per phylum using lapply. To avoid the klustering function from returning error for potential phyla with low number of ASVs, I set a threshold to rename low frequency phyla to "other". In such way, the max. memory usage is limited to the phylum with the largest number of ASVs (7000 for Proteobacteria in my dataset). So I am able to reduce by around x5 memory usage. I share the code with the functions (first chunk) and execution (second chunk):

# filter ASVs with no phylum assigned
ps_filter_phylum_is_NA <- function(ps = NULL) {
  require(phyloseq)
  #filter out ASVs with no Phylum assigned
  h <- subset_taxa(ps, !is.na(phylum))
  no_reads <- mean(rowSums(otu_table(h)) / rowSums(otu_table(ps)))
  no_taxa <- ntaxa(h) / ntaxa(ps)
  #report some results from the filtering
  cat("\nAfter removing ASVs with no phylum assigned, a",
      no_reads, "of the reads and a",
      no_taxa, "of the ASVs were retained.\n")
  return(h)
}

# function to filter eukaryotic ASVs from 16S metabarcoding.
ps_filter_organelles <- function(ps = NULL) {
  cloro <- phyloseq::subset_taxa(ps, order %in% "Chloroplast")
  mito <- phyloseq::subset_taxa(ps, family %in% "Mitochondria")
  h <- phyloseq::subset_taxa(
    ps, !order %in% "Chloroplast" & !family %in% "Mitochondria")
  cat("\nA total of", ntaxa(cloro), "ASVs from Chloroplast and", ntaxa(mito),
      "ASVs from Mitochondria were removed from the phyloseq object\n")
  assertthat::assert_that(
    sum(
      grepl(
        x = tax_table(h),
        pattern = "[Cc]hloroplast|[Mm]itochondria|[Ee]ukary")
    ) == 0,
    msg = "Not all organelle DNA has been filtered.")
  cat("\nAny taxonomic rank matching of",
      "[Cc]hloroplast|[Mm]itochondria|[Ee]ukary has been removed\n")
  no_reads <- mean(rowSums(otu_table(h)) / rowSums(otu_table(ps)))
  no_taxa <- ntaxa(h) / ntaxa(ps)
  #report some results from the filtering
  cat("\nAfter removing ASVs from organelles, a",
      no_reads, "of the reads and a",
      no_taxa, "of the ASVs were retained.\n")
  return(h)
}

# return feature table from KTU:klustering with clustered OTUs and with DNA sequences as rownames
klustering2otutable <- function(kobject = NULL) {
  otut <- kobject$KTU.table
  seqshash <- kobject$ReqSeq
  stopifnot(all(names(seqshash) == rownames(otut)))
  rownames(otut) <- seqshash
  return(otut)
}

# otu table to data frame from phyloseq object
otu_table2df <- function(ps = NULL) {
  require(phyloseq)
  # Extract abundance matrix from the phyloseq object
  otu1 <- as(otu_table(ps), "matrix")
  # transpose if necessary
  if(!taxa_are_rows(ps)){otu1 <- t(otu1)}
  # Coerce to data.frame
  as.data.frame(otu1)
}
# create feature table from phyloseq with rownames as DNA sequences
ps2otutable <- function(ps = NULL) {
  otut <-
    otu_table(ps) %>%
    {if(!taxa_are_rows(.)) {
      t(.) 
    } else . }
  rownames(otut) <- as.character(refseq(ps))
  return(otut)
}

# run klustering from dada or phyloseq object
dada2KTU <- function(data = NULL, subset = NULL, path2fasta = NULL, ncores = 1) {
  # data, phyloseq or seqtab
  # subset, a numberic for debugging. Limit klustering to first n ASVs in subset.
  # path2fasta, path to store temporary fasta with all seqs, previous to clustering.
  # ncores, number of cores to be used by klustering
  require(Biostrings); require(dplyr); require(phyloseq)
  #feature table (ie OTU table) from phyloseq
  ft <-
    if(class(data) == "phyloseq") {
      otu_table2df(data)
    } else if(class(data)[1] =="matrix") {
      data.frame(t(data))
    }
  ft <- tibble::rownames_to_column(ft, "asv")
  # write fasta from data
  seqs <-
    if(is.null(refseq(data, errorIfNULL = F))) {
      hh <- Biostrings::DNAStringSet(dimnames(data)[[2]])
      names(hh) <- ft$asv
      hh
    } else if(!is.null(refseq(data, errorIfNULL = F))) {
      phyloseq::refseq(data)
    }
  if(is.numeric(subset)){
    seqs <- seqs[1:subset]
    ft <- ft[1:subset, ]
  }
  # check seqs is a DNA string set
  stopifnot(class(seqs) == "DNAStringSet")
  # write seqs to fasta. Alternatively, the "KTU::klustering" function could be edited to accept directly the DNAStringSet to the "scaffold" object inside the function instead of having to write it. 
  Biostrings::writeXStringSet(seqs,
                              format = "fasta",
                              filepath = path2fasta,
                              width = max(seqs@ranges@width))
  # run clustering
  KTU::klustering(repseq = path2fasta,
                  feature.table = ft,
                  write.fasta = TRUE,
                  cores = ncores)
}

KTU computation:


library(dplyr)
library(KTU)
library(phyloseq)
library(Biostrings)
library(digest)

#read load phyloseq
# ps format: rank "phylum" must be spelled extactly "phylum"; refseq slot must be present.
ps <- readRDS("ps.rds")

#load functions for filtering phyloseq objects
source("otu_table2df.r")
source("dada2KTU.R")
source("klustering2otutable.r")
source("ps2otutable.r")

# minimum filtering of phyloseq
ps_filt <-
  ps %>%
  ps_filter_phylum_is_NA() %>%
  ps_filter_organelles()

# low frequency phyla: all for which their cummulated sum  of ASVs is below 1000 in increasing order.
lowfreq_phyl <-
  as.data.frame(tax_table(ps_filt)) %>%
  count(phylum) %>%
  arrange(n) %>%
  mutate(cums = cumsum(n)) %>%
  filter(cums < 1000) %>% 
  pull(phylum)
# ps with low freq phyla renamed to "other"
#   vector with renamed phyla
glomphyl <-
  as.data.frame(tax_table(ps_filt)) %>%
  mutate(phylum = replace(phylum,
                          which(phylum %in% lowfreq_phyl),
                          "other")) %>%
  pull(phylum)
#   replace phyla
tax_table(ps_filt)@.Data[,2] <- glomphyl

# create table with all identified phyla and their corresponding md5 checksum
phylumMD5 <-
  as.data.frame(tax_table(ps_filt)) %>%
  count(phylum) %>%
  arrange(n) %>%
  mutate(md5 = sapply(phylum, digest, algo = "md5"),
         cums = cumsum(n))
# print to screen
cat("\nPhlumMD5 created")
# compute KTUs
kotus <-
  phylumMD5$md5 %>%
  lapply(function(md5x) {
    #phylum name for given md5
    phylumx <<- phylumMD5[phylumMD5$md5 == md5x, 1]
    # subset given phylumx from the phyloseq objecy
    psx <-
      phyloseq::subset_taxa(physeq = ps_filt,
                            phylum == phylumx)
    dada2KTU(data = psx,
             path2fasta = paste0(md5x, ".fna"),
             ncores = 1) %>%
      klustering2otutable()
  })  %>%
  setNames(NULL) %>% # to avoid renaming rownames
  do.call(what = "rbind")

saveRDS(kotus, "ktus.rds")
poyuliu commented 2 years ago

Great!! @csmiguel, Many thanks for sharing this excellent approach! It seems sensible and biological-meaningful to deal with large datasets and memory usage issues. My colleagues and I will try this approach with our datasets. I'm also trying to modify the algorithm based on this idea without a previous taxonomy assignment.

Cheers, Po-Yu

csmiguel commented 2 years ago

Thanks! cool! yes, this approach is a bit too complex but it works. Otherwise, I guess a rough clustering with no taxonomic info before determining KTUs could work as well.