Exporting data as bigwigs

bapoorva commented 5 years ago

Hi,

Great package for analysing scATAC-seq. I was wondering, is there is a way to export the per cluster coverage as wiggle tracks or bigwig file ? While the CoveragePlot function does the job, this is if we need to view it in UCSC genome browser.

Thanks

timoast commented 5 years ago

Hi, it’s not currently possible to create a bigwig for different groups of cells in Signac. I’d suggest writing the cell names to a file and then splitting the bam file by cell using the sinto package (https://github.com/timoast/sinto), and then creating bigwig tracks using deeptools (https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html)

enrllo commented 4 years ago

Hi, Thanks for the great package. I agree that it would be very useful to be able to export the per cluster coverage tracks. Are there plans to support that in future releases? Alternatively, would there be a way to export a file containing the peaks present in a given cluster (with certain reasonable cutoffs)? Thanks

timoast commented 4 years ago

@enrllo I don't have plans to support that in the near future

Here is an example showing how to find peaks open in >20% of a given group of cells:

library(Seurat); library(Signac)
cells.use <- WhichCells(pbmc, ident = 'pDC')
counts <- GetAssayData(pbmc, assay = 'peaks', slot = 'counts')[, cells.use]
percent.open <- rowSums(BinarizeCounts(object = counts)) / length(cells.use)
peaks <- names(which(percent.open > 0.2))

enrllo commented 4 years ago

Thanks!

michael-kotliar commented 2 years ago

Faced the same problem, so wrote this CWL file for that https://github.com/Barski-lab/workflows-datirium/blob/master/workflows/sc-genome-coverage.cwl The main scripts are https://github.com/Barski-lab/workflows/blob/master/tools/dockerfiles/scripts/extract_seurat_metadata.R and https://github.com/Barski-lab/workflows/blob/master/tools/dockerfiles/scripts/sc_split_atac.py Then use bedtools genomecov to generate bigWigs.

Baboon61 commented 1 year ago

Exporting bigwig files from a seurat object for multiples clusters and/or condition would be a really nice feature.

I rearranged the getGroupBW from ArchR.

The splitGroupFragments is extracting the fragments for each cell from each group and generate a new fragments file per group. This step is computationaly heavy and can take a while if the fragments file is huge. In order not to rerun the function many times to check if it is working, I made a test function that will take the first entries of the fragments file as a test set.

The exportGroupBW, on the other hand is quite fast to run and generates normalized bigwig files for each group from the split fragments file.

I am making some system calls here and there to manage, sort and compress the files.

The temporary directory is not deleted at the end of the process, if one wants to double check the split fragments files, or rerun from a specific group if the function crashed.

I am using the future package to split the fragments file, run the calculation in parallel and generate bigwig files.

I have not tried it on pbmc, if you do, let me know.

library(Signac)
library(Seurat)
library(GenomicRanges)
library(rtracklayer)
library(dplyr)
library(data.table)
library(purrr)
library(future)
library(future.apply)

Maybe some packages are missing

splitGroupFragments(
  object = object,
  assay = "ATAC",
  genome = "mm10",
  groupBy = c("cluster", "timepoint"),
  downGroupBy = list(cluster=c("c1", "c2"), timepoint=c("t1", "t2")), # If you are interested in only some clusters or timepoints. It speeds up the calculation. "all" will all combinaison of arguments
  minCells = 5, #remove group with less than minCells
  maxCells = 1000, #cut off to maxCells per group
  nbFrags = 100000000, # A broad idea of how many fragments are in the fragment file. Better put more than less. You can "wc -l" on the fragments file
  threads=8,
  test=TRUE, #Try the script with the first "test.nbFrags" fragments of the fragments file
  test.nbFrags=10000,
  outdir = "~/bigwig",
  tmpdir = "~/bigwig/tmp
)

exportGroupBW(
  object = object,
  assay = "ATAC",
  genome = "mm10",
  groupBy = c("cluster", "timepoint"),
  downGroupBy = list(cluster=c("c1", "c2"), timepoint=c("t1", "t2")), # If you are interested in only some clusters or timepoints. It speeds up the calculation. "all" allows to keep all combinaison of arguments
  normMethod = "TSS.enrichment", #'ncells', 'none' or any quantitative values from @meta.data
  tileSize = 100,
  cutoff = 4, #Maximum number of fragment in a tile
  chromosome = "primary", #"primary" or "all"
  threads = 8, 
  outdir = "~/bigwig",
  tmpdir = "~/bigwig/tmp
)

splitGroupFragments <- function(
  object = NULL,
  assay = "ATAC",
  genome = "mm10",
  groupBy = "Sample",
  downGroupBy = "all",
  minCells = 5,
  maxCells = 1000,
  nbFrags = 10000000,
  threads=NULL,
  test=FALSE,
  test.nbFrags=NULL,
  outdir = NULL,
  tmpdir = NULL
  ){

    dir.create(file.path(outdir), showWarnings = FALSE)
    dir.create(tmpdir, showWarnings = FALSE)

    DefaultAssay(object) <- assay
    genome(object) <- genome

    #Get fragments file path
    if (length(Fragments(object)) > 1){
        print("The object contains a list of fragments files")
        frag_file <- unique(unlist(lapply(Fragments(object), function(x){GetFragmentData(x)})))
        if (frag_file > 1){
            #Could also try to run on each specific fragments files without splitting
            print("The list of fragments files are pointing to different fragment files")
            print("Please, aggregate the fragments files before submitting")
        }else{
            bed_input_path <- frag_file
        }
    }else{
        bed_input_path <- GetFragmentData(Fragments(object))
    }

    if(test){
        dir.create(paste0(tmpdir, "/test"), showWarnings = FALSE)
        nbFrags <- test.nbFrags
        print(paste0("Testing on a small chunk of the fragments file, selecting : ",nbFrags, " fragments"))
        system(sprintf("zcat %s | head -%d > %s/test/test_%s_atac_fragments.tsv", bed_input_path, nbFrags, tmpdir, as.character(nbFrags)))
        system(sprintf("gzip %s/test/test_%s_atac_fragments.tsv", tmpdir, as.character(nbFrags)))
        bed_input_path <- paste0(tmpdir, "/test/test_", nbFrags, "_atac_fragments.tsv.gz")
        print(paste0("Testing file located at : ",bed_input_path))
    }

    annot <- object@meta.data

    #Column to group the cells by
    Groups <- annot[, groupBy, drop=FALSE]

    if(unlist(downGroupBy)[1] != 'all'){
        Groups <- Groups[which(Reduce(`&`, Map(`%in%`, Groups[intersect(names(Groups), names(downGroupBy))], downGroupBy))),]
    }

    #Split cell names by group
    cellGroups <- lapply(split(rownames(Groups), factor(apply(Groups, 1, paste, collapse="_"))), unique)

    #Sample each cell group to maxCells
    if(!is.null(maxCells)){
        cellGroups <- lapply(cellGroups, function(x){if(length(x) <= maxCells){x}else{sample(x, maxCells)}})
    }

    #Remove group with less than maxCells
    if(!is.null(minCells)){
        cellGroups <- Filter(function(x) length(x) >= minCells, cellGroups)
    }

    #Create a vector of selected cells with the group as name
    cellGroups_rev_list <- unlist(cellGroups)
    names(cellGroups_rev_list) <- rep(names(cellGroups), sapply(cellGroups, length))

    ##If there is 1B line in the fragment file
    ##We can split the fragments file by the number of available threads (minus 1 thread not to overload)

    if (threads == 1){
        avai_threads <- 1
    }else{
        avai_threads <- threads - 1
    }
    theorical_nbfile_per_thread <- nbFrags/avai_threads
    suitable_nbfile_per_thread <- ceiling(theorical_nbfile_per_thread)
    print(paste0("Each sub-fragments files will contain : ", suitable_nbfile_per_thread," fragments"))
    system(sprintf("gunzip -c %s | split --lines %d --additional-suffix=.tsv - %s/sub_atac_fragments", bed_input_path, suitable_nbfile_per_thread, tmpdir))

    #Set number of thread in future
    plan("multicore", workers = avai_threads)

    files <- list.files(tmpdir)[list.files(tmpdir) %like% "sub_atac_fragments"]    
    for(group in unique(names(cellGroups_rev_list))){
        cells <- cellGroups_rev_list[names(cellGroups_rev_list)==group]
        future_lapply(files,.groupfragments, cells, group, tmpdir, future.stdout=FALSE)
        system(sprintf("cat %s/%s* | sort -k1,1 -k2,2n -k3,3n - > %s/%s.tsv", tmpdir, group, tmpdir, group))
        system(sprintf("rm %s/%s_sub*", tmpdir, group))
        system(sprintf("gzip %s/%s.tsv", tmpdir, group))
    }

    #Removing sub-fragments files
    system(sprintf("rm %s/sub_atac_fragments*", tmpdir))

    plan("multicore", workers = 1)

    if(test){
        system(sprintf("rm -r %s/test", tmpdir))
    }
  }

#Function to generate fragments files
.groupfragments <- function(x, cellname, cellgroup, tmpdir) {
  bed_input <- gzfile(paste0(tmpdir,"/",x), "r")
  bed_output <- file(paste0(tmpdir, "/", cellgroup, "_", strsplit(x,"\\.")[[1]][1], ".tsv"), "w")

  while(length(line <- readLines(bed_input, n = 1)) > 0) {
    cell_name_bed <- strsplit(line, "\t")[[1]][4]
    if (cell_name_bed %in% cellname) {
      writeLines(line, bed_output)
    }
  }
  close(bed_input)
  close(bed_output)
}

exportGroupBW  <- function(
  object = NULL,
  assay = "ATAC",
  genome = "mm10",
  groupBy = "Sample",
  downGroupBy = "all",
  normMethod = "TSS.enrichment", #'ncells', 'none' or any quantitative values from @meta.data
  tileSize = 100,
  cutoff = 4,
  chromosome = "primary", #primary or all
  threads = NULL,
  outdir = NULL,
  tmpdir = NULL
  ){

    DefaultAssay(object) <- assay
    genome(object) <- genome
    annot <- object@meta.data

    #Column to group the cells by
    Groups <- annot[, groupBy, drop=FALSE]

    GroupsNames <- unique(apply(annot[, groupBy], 1, paste, collapse = "_"))

    if(unlist(downGroupBy)[1] != 'all'){
        downGroupsNames <- apply(do.call(expand.grid, downGroupBy),1, paste, collapse = "_")
        GroupsNames <- GroupsNames[GroupsNames %in% downGroupsNames]
    }

    #Column to normalized by
    if (normMethod == 'ncells'){
        normBy <- normMethod
    } else{
        normBy <- annot[, normMethod, drop=FALSE]
    }

    #Get chromosome information
    if(chromosome=="primary"){
        prim_chr <- names(seqlengths(object)[!grepl("_alt|_fix|_random|chrUn", names(seqlengths(object)))])
        seqlevels(object) <- prim_chr
    }
    availableChr <- names(seqlengths(object))
    chromLengths <- seqlengths(object)
    chromSizes <- GRanges(seqnames = availableChr, IRanges(start = rep(1, length(availableChr)), end = as.numeric(chromLengths)))

    #Create tiles for each chromosome, from GenomicRanges
    tiles <- unlist(slidingWindows(chromSizes, width = tileSize, step = tileSize))

    if (threads == 1){
        avai_threads <- 1
    }else{
        avai_threads <- threads - 1
    }

    #Set number of thread in future
    plan("multicore", workers = avai_threads)

    #Run the creation of bigwig for each cellgroups
    covFiles <- future_lapply(GroupsNames, .createBWGroup, availableChr, tiles, normBy, tileSize, normMethod, cutoff, outdir, tmpdir)

    plan("multicore", workers = 1)

    covFiles

  }

.createBWGroup <- function(groupNamei, availableChr, tiles, normBy, tileSize, normMethod, cutoff, outdir, tmpdir){

    #Read the fragments file associated to the group
    fragi <- rtracklayer::import(paste0(tmpdir, "/", groupNamei, ".tsv.gz"),format = "bed")

    cellGroupi <- unique(fragi$name)

    #Open the writting bigwig file
    covFile <- file.path(outdir, paste0(groupNamei, "-TileSize-",tileSize,"-normMethod-",normMethod,".bw"))

    covList <- lapply(seq_along(availableChr), function(k){

        fragik <- fragi[seqnames(fragi) == availableChr[k],]
        tilesk <- tiles[BiocGenerics::which(seqnames(tiles) %bcin% availableChr[k])]

        if(length(fragik) == 0){
            tilesk$reads <- 0
        #If fragments
        }else{

          #N Tiles
          nTiles <- chromLengths[availableChr[k]] / tileSize
          #Add one tile if there is extra bases
          if (nTiles%%1 != 0) {
              nTiles <- trunc(nTiles) + 1
          }

          #Create Sparse Matrix
          matchID <- S4Vectors::match(mcols(fragik)$name, cellGroupi)

          #For each tiles of this chromosome, create start tile and end tile row, set the associated counts matching with the fragments
          mat <- Matrix::sparseMatrix(
              i = c(trunc(start(fragik) / tileSize), trunc(end(fragik) / tileSize)) + 1,
              j = as.vector(c(matchID, matchID)),
              x = rep(1,  2*length(fragik)),
              dims = c(nTiles, length(cellGroupi)))

          #Max count for a cells in a tile is set to ceiling (4)
          if(!is.null(cutoff)){
            mat@x[mat@x > cutoff] <- cutoff
          }

          #Sums the cells
          mat <- Matrix::rowSums(mat)

          tilesk$reads <- mat

          #Normalization of counts by the sum of readsintss for each cells in group
          if(normMethod == "ncells"){
              tilesk$reads <- tilesk$reads / length(cellGroupi)
          }else if(tolower(normMethod) %in% c("none")){
          }else{
              tilesk$reads <- tilesk$reads * 10^4 / sum(normBy[cellGroupi, 1])
          }
        }

        tilesk <- coverage(tilesk, weight = tilesk$reads)[[availableChr[k]]]

        tilesk

    })

    names(covList) <- availableChr

    covList <- as(covList, "RleList")

    rtracklayer::export.bw(object = covList, con = covFile)

    covFile
}

timoast commented 1 year ago

Thanks @Baboon61 for your work on this. Just want to point out the SplitFragments() function in Signac that may be a faster solution to splitGroupFragments here and is written in C++.

I'd be happy to consider a PR implementing bigwig export

stuart-lab / signac

Exporting data as bigwigs #28