Closed bapoorva closed 5 years ago
Hi, it’s not currently possible to create a bigwig for different groups of cells in Signac. I’d suggest writing the cell names to a file and then splitting the bam file by cell using the sinto package (https://github.com/timoast/sinto), and then creating bigwig tracks using deeptools (https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html)
Hi, Thanks for the great package. I agree that it would be very useful to be able to export the per cluster coverage tracks. Are there plans to support that in future releases? Alternatively, would there be a way to export a file containing the peaks present in a given cluster (with certain reasonable cutoffs)? Thanks
@enrllo I don't have plans to support that in the near future
Here is an example showing how to find peaks open in >20% of a given group of cells:
library(Seurat); library(Signac)
cells.use <- WhichCells(pbmc, ident = 'pDC')
counts <- GetAssayData(pbmc, assay = 'peaks', slot = 'counts')[, cells.use]
percent.open <- rowSums(BinarizeCounts(object = counts)) / length(cells.use)
peaks <- names(which(percent.open > 0.2))
Thanks!
Faced the same problem, so wrote this CWL file for that
https://github.com/Barski-lab/workflows-datirium/blob/master/workflows/sc-genome-coverage.cwl
The main scripts are
https://github.com/Barski-lab/workflows/blob/master/tools/dockerfiles/scripts/extract_seurat_metadata.R
and
https://github.com/Barski-lab/workflows/blob/master/tools/dockerfiles/scripts/sc_split_atac.py
Then use bedtools genomecov
to generate bigWigs.
Exporting bigwig files from a seurat object for multiples clusters and/or condition would be a really nice feature.
I rearranged the getGroupBW
from ArchR.
The splitGroupFragments
is extracting the fragments for each cell from each group and generate a new fragments file per group. This step is computationaly heavy and can take a while if the fragments file is huge. In order not to rerun the function many times to check if it is working, I made a test function that will take the first entries of the fragments file as a test set.
The exportGroupBW
, on the other hand is quite fast to run and generates normalized bigwig files for each group from the split fragments file.
I am making some system
calls here and there to manage, sort and compress the files.
The temporary directory is not deleted at the end of the process, if one wants to double check the split fragments files, or rerun from a specific group if the function crashed.
I am using the future package to split the fragments file, run the calculation in parallel and generate bigwig files.
I have not tried it on pbmc
, if you do, let me know.
library(Signac)
library(Seurat)
library(GenomicRanges)
library(rtracklayer)
library(dplyr)
library(data.table)
library(purrr)
library(future)
library(future.apply)
Maybe some packages are missing
splitGroupFragments(
object = object,
assay = "ATAC",
genome = "mm10",
groupBy = c("cluster", "timepoint"),
downGroupBy = list(cluster=c("c1", "c2"), timepoint=c("t1", "t2")), # If you are interested in only some clusters or timepoints. It speeds up the calculation. "all" will all combinaison of arguments
minCells = 5, #remove group with less than minCells
maxCells = 1000, #cut off to maxCells per group
nbFrags = 100000000, # A broad idea of how many fragments are in the fragment file. Better put more than less. You can "wc -l" on the fragments file
threads=8,
test=TRUE, #Try the script with the first "test.nbFrags" fragments of the fragments file
test.nbFrags=10000,
outdir = "~/bigwig",
tmpdir = "~/bigwig/tmp
)
exportGroupBW(
object = object,
assay = "ATAC",
genome = "mm10",
groupBy = c("cluster", "timepoint"),
downGroupBy = list(cluster=c("c1", "c2"), timepoint=c("t1", "t2")), # If you are interested in only some clusters or timepoints. It speeds up the calculation. "all" allows to keep all combinaison of arguments
normMethod = "TSS.enrichment", #'ncells', 'none' or any quantitative values from @meta.data
tileSize = 100,
cutoff = 4, #Maximum number of fragment in a tile
chromosome = "primary", #"primary" or "all"
threads = 8,
outdir = "~/bigwig",
tmpdir = "~/bigwig/tmp
)
splitGroupFragments <- function(
object = NULL,
assay = "ATAC",
genome = "mm10",
groupBy = "Sample",
downGroupBy = "all",
minCells = 5,
maxCells = 1000,
nbFrags = 10000000,
threads=NULL,
test=FALSE,
test.nbFrags=NULL,
outdir = NULL,
tmpdir = NULL
){
dir.create(file.path(outdir), showWarnings = FALSE)
dir.create(tmpdir, showWarnings = FALSE)
DefaultAssay(object) <- assay
genome(object) <- genome
#Get fragments file path
if (length(Fragments(object)) > 1){
print("The object contains a list of fragments files")
frag_file <- unique(unlist(lapply(Fragments(object), function(x){GetFragmentData(x)})))
if (frag_file > 1){
#Could also try to run on each specific fragments files without splitting
print("The list of fragments files are pointing to different fragment files")
print("Please, aggregate the fragments files before submitting")
}else{
bed_input_path <- frag_file
}
}else{
bed_input_path <- GetFragmentData(Fragments(object))
}
if(test){
dir.create(paste0(tmpdir, "/test"), showWarnings = FALSE)
nbFrags <- test.nbFrags
print(paste0("Testing on a small chunk of the fragments file, selecting : ",nbFrags, " fragments"))
system(sprintf("zcat %s | head -%d > %s/test/test_%s_atac_fragments.tsv", bed_input_path, nbFrags, tmpdir, as.character(nbFrags)))
system(sprintf("gzip %s/test/test_%s_atac_fragments.tsv", tmpdir, as.character(nbFrags)))
bed_input_path <- paste0(tmpdir, "/test/test_", nbFrags, "_atac_fragments.tsv.gz")
print(paste0("Testing file located at : ",bed_input_path))
}
annot <- object@meta.data
#Column to group the cells by
Groups <- annot[, groupBy, drop=FALSE]
if(unlist(downGroupBy)[1] != 'all'){
Groups <- Groups[which(Reduce(`&`, Map(`%in%`, Groups[intersect(names(Groups), names(downGroupBy))], downGroupBy))),]
}
#Split cell names by group
cellGroups <- lapply(split(rownames(Groups), factor(apply(Groups, 1, paste, collapse="_"))), unique)
#Sample each cell group to maxCells
if(!is.null(maxCells)){
cellGroups <- lapply(cellGroups, function(x){if(length(x) <= maxCells){x}else{sample(x, maxCells)}})
}
#Remove group with less than maxCells
if(!is.null(minCells)){
cellGroups <- Filter(function(x) length(x) >= minCells, cellGroups)
}
#Create a vector of selected cells with the group as name
cellGroups_rev_list <- unlist(cellGroups)
names(cellGroups_rev_list) <- rep(names(cellGroups), sapply(cellGroups, length))
##If there is 1B line in the fragment file
##We can split the fragments file by the number of available threads (minus 1 thread not to overload)
if (threads == 1){
avai_threads <- 1
}else{
avai_threads <- threads - 1
}
theorical_nbfile_per_thread <- nbFrags/avai_threads
suitable_nbfile_per_thread <- ceiling(theorical_nbfile_per_thread)
print(paste0("Each sub-fragments files will contain : ", suitable_nbfile_per_thread," fragments"))
system(sprintf("gunzip -c %s | split --lines %d --additional-suffix=.tsv - %s/sub_atac_fragments", bed_input_path, suitable_nbfile_per_thread, tmpdir))
#Set number of thread in future
plan("multicore", workers = avai_threads)
files <- list.files(tmpdir)[list.files(tmpdir) %like% "sub_atac_fragments"]
for(group in unique(names(cellGroups_rev_list))){
cells <- cellGroups_rev_list[names(cellGroups_rev_list)==group]
future_lapply(files,.groupfragments, cells, group, tmpdir, future.stdout=FALSE)
system(sprintf("cat %s/%s* | sort -k1,1 -k2,2n -k3,3n - > %s/%s.tsv", tmpdir, group, tmpdir, group))
system(sprintf("rm %s/%s_sub*", tmpdir, group))
system(sprintf("gzip %s/%s.tsv", tmpdir, group))
}
#Removing sub-fragments files
system(sprintf("rm %s/sub_atac_fragments*", tmpdir))
plan("multicore", workers = 1)
if(test){
system(sprintf("rm -r %s/test", tmpdir))
}
}
#Function to generate fragments files
.groupfragments <- function(x, cellname, cellgroup, tmpdir) {
bed_input <- gzfile(paste0(tmpdir,"/",x), "r")
bed_output <- file(paste0(tmpdir, "/", cellgroup, "_", strsplit(x,"\\.")[[1]][1], ".tsv"), "w")
while(length(line <- readLines(bed_input, n = 1)) > 0) {
cell_name_bed <- strsplit(line, "\t")[[1]][4]
if (cell_name_bed %in% cellname) {
writeLines(line, bed_output)
}
}
close(bed_input)
close(bed_output)
}
exportGroupBW <- function(
object = NULL,
assay = "ATAC",
genome = "mm10",
groupBy = "Sample",
downGroupBy = "all",
normMethod = "TSS.enrichment", #'ncells', 'none' or any quantitative values from @meta.data
tileSize = 100,
cutoff = 4,
chromosome = "primary", #primary or all
threads = NULL,
outdir = NULL,
tmpdir = NULL
){
DefaultAssay(object) <- assay
genome(object) <- genome
annot <- object@meta.data
#Column to group the cells by
Groups <- annot[, groupBy, drop=FALSE]
GroupsNames <- unique(apply(annot[, groupBy], 1, paste, collapse = "_"))
if(unlist(downGroupBy)[1] != 'all'){
downGroupsNames <- apply(do.call(expand.grid, downGroupBy),1, paste, collapse = "_")
GroupsNames <- GroupsNames[GroupsNames %in% downGroupsNames]
}
#Column to normalized by
if (normMethod == 'ncells'){
normBy <- normMethod
} else{
normBy <- annot[, normMethod, drop=FALSE]
}
#Get chromosome information
if(chromosome=="primary"){
prim_chr <- names(seqlengths(object)[!grepl("_alt|_fix|_random|chrUn", names(seqlengths(object)))])
seqlevels(object) <- prim_chr
}
availableChr <- names(seqlengths(object))
chromLengths <- seqlengths(object)
chromSizes <- GRanges(seqnames = availableChr, IRanges(start = rep(1, length(availableChr)), end = as.numeric(chromLengths)))
#Create tiles for each chromosome, from GenomicRanges
tiles <- unlist(slidingWindows(chromSizes, width = tileSize, step = tileSize))
if (threads == 1){
avai_threads <- 1
}else{
avai_threads <- threads - 1
}
#Set number of thread in future
plan("multicore", workers = avai_threads)
#Run the creation of bigwig for each cellgroups
covFiles <- future_lapply(GroupsNames, .createBWGroup, availableChr, tiles, normBy, tileSize, normMethod, cutoff, outdir, tmpdir)
plan("multicore", workers = 1)
covFiles
}
.createBWGroup <- function(groupNamei, availableChr, tiles, normBy, tileSize, normMethod, cutoff, outdir, tmpdir){
#Read the fragments file associated to the group
fragi <- rtracklayer::import(paste0(tmpdir, "/", groupNamei, ".tsv.gz"),format = "bed")
cellGroupi <- unique(fragi$name)
#Open the writting bigwig file
covFile <- file.path(outdir, paste0(groupNamei, "-TileSize-",tileSize,"-normMethod-",normMethod,".bw"))
covList <- lapply(seq_along(availableChr), function(k){
fragik <- fragi[seqnames(fragi) == availableChr[k],]
tilesk <- tiles[BiocGenerics::which(seqnames(tiles) %bcin% availableChr[k])]
if(length(fragik) == 0){
tilesk$reads <- 0
#If fragments
}else{
#N Tiles
nTiles <- chromLengths[availableChr[k]] / tileSize
#Add one tile if there is extra bases
if (nTiles%%1 != 0) {
nTiles <- trunc(nTiles) + 1
}
#Create Sparse Matrix
matchID <- S4Vectors::match(mcols(fragik)$name, cellGroupi)
#For each tiles of this chromosome, create start tile and end tile row, set the associated counts matching with the fragments
mat <- Matrix::sparseMatrix(
i = c(trunc(start(fragik) / tileSize), trunc(end(fragik) / tileSize)) + 1,
j = as.vector(c(matchID, matchID)),
x = rep(1, 2*length(fragik)),
dims = c(nTiles, length(cellGroupi)))
#Max count for a cells in a tile is set to ceiling (4)
if(!is.null(cutoff)){
mat@x[mat@x > cutoff] <- cutoff
}
#Sums the cells
mat <- Matrix::rowSums(mat)
tilesk$reads <- mat
#Normalization of counts by the sum of readsintss for each cells in group
if(normMethod == "ncells"){
tilesk$reads <- tilesk$reads / length(cellGroupi)
}else if(tolower(normMethod) %in% c("none")){
}else{
tilesk$reads <- tilesk$reads * 10^4 / sum(normBy[cellGroupi, 1])
}
}
tilesk <- coverage(tilesk, weight = tilesk$reads)[[availableChr[k]]]
tilesk
})
names(covList) <- availableChr
covList <- as(covList, "RleList")
rtracklayer::export.bw(object = covList, con = covFile)
covFile
}
Thanks @Baboon61 for your work on this. Just want to point out the SplitFragments()
function in Signac that may be a faster solution to splitGroupFragments
here and is written in C++.
I'd be happy to consider a PR implementing bigwig export
Hi,
Great package for analysing scATAC-seq. I was wondering, is there is a way to export the per cluster coverage as wiggle tracks or bigwig file ? While the CoveragePlot function does the job, this is if we need to view it in UCSC genome browser.
Thanks