Open jeremymsimon opened 1 year ago
@paupaiz possibly, could you briefly describe more? My intention here is to be able to compare accessibility between experimental conditions (e.g. WT and mutant) where we have multiple replicates of each. The test would produce cluster-by-cluster results of genomic regions, log-fold-changes, and adjusted p-values.
The ArchR documentation on this here uses an example of one cell type vs another (ie finding marker regions that are specific to one cell type), which IMO is a slightly different problem, but as long as it does this in a replicate-aware fashion it could also work?
Could be downstream of #1
Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of
muscat
andcsaw
For multilevel pseudobulk analyses, tidybulk
has incorporated glmmSeq
and optimised for very large datasets.
see ?test_differential_abundance
in the new github release
HPCell
(https://github.com/stemangiola/HPCell), in development but functioning, scales multilevel analyses, to the cluster, with no coding overload. We will announce this functionality soon.
@stemangiola @jeremymsimon I developed condition-aware pseudobulking for ArchR motivated by this paper We will include this in the next release (1.0.3). Let me know if it would be useful to have an example here!
I have working example using Signac and Deseq2. Am I eligible to contribute/provide the code? I am sorry for the basic question but could not find anything in the guidelines for contribution.
@jeremymsimon, @stemangiola, @paupaiz What is your opinion - is there any benefit to go with SIgnac/DEseq2 example?
@AmelZulji would love to check out your post so I can answer!
Hi @paupaiz, @jeremymsimon, @stemangiola
Here is the reproducible example the code:
# create directory for downloading files
dir.create("tmp")
# downmload files in the tmp directory
download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5",
destfile = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_singlecell.csv",
destfile = "tmp/atac_v1_pbmc_10k_singlecell.csv")
download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz",
destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz")
download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz.tbi",
destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz.tbi")
library(Seurat) # for convinient functions
library(Signac) # for handling ATAC data
# construct assay with ATAC data
counts <- Read10X_h5(filename = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
file = "tmp/atac_v1_pbmc_10k_singlecell.csv",
header = TRUE,
row.names = 1
)
chrom_assay <- CreateChromatinAssay(
counts = counts,
sep = c(":", "-"),
fragments = 'tmp/atac_v1_pbmc_10k_fragments.tsv.gz',
min.cells = 10,
min.features = 200
)
pbmc <- CreateSeuratObject(
counts = chrom_assay,
assay = "peaks",
meta.data = metadata
)
# a basic processing adapted from https://stuartlab.org/signac/articles/pbmc_vignette
# extract gene annotations from EnsDb
library(EnsDb.Hsapiens.v86)
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg19"
# add the gene information to the object
Annotation(pbmc) <- annotations
pbmc <- subset(x = pbmc, subset = nCount_peaks > 3000 & nCount_peaks < 30000)
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3, resolution = 0.1)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
pbmc[["cell_type"]] <- Idents(pbmc)
# subset to speed up the process
pbmc_sub <- subset(pbmc, cell_type %in% c(0,1))
# add column with donor, sex and condition to simulate groups for testing
obj_meta <- pbmc_sub@meta.data
obj_meta[["donor"]] <- sample(x = paste0("donor_", 1:4), size = nrow(obj_meta), replace = T)
obj_meta[["barcode"]] <- rownames(obj_meta)
df <-
tibble::tibble(
donor = unique(obj_meta$donor),
condition = rep(c("Control", "Disease"), c(2,2)),
sex = rep(c("F", "M"), c(2))
)
obj_meta <- dplyr::left_join(obj_meta, df)
rownames(obj_meta) <- obj_meta$barcode
pbmc_sub@meta.data <- obj_meta
# aggregate counts by summing up counts
pb_counts <- AggregateExpression(pbmc_sub, assays = "peaks", slot = "counts",group.by = "donor")$peaks
# adjust metadata
pb_meta <- pbmc_sub@meta.data[,c("condition", "sex", "donor")] |> unique()
rownames(pb_meta) <- pb_meta$donor
library(DESeq2)
# match ordering in metadata and count data
sample_meta <- pb_meta[match(colnames(pb_counts), rownames(pb_meta)),]
dds <- DESeqDataSetFromMatrix(pb_counts,
colData = sample_meta,
design = ~ sex + condition)
dds <- DESeq(dds)
res <- results(dds) |> as.data.frame()
head(res)
Thanks @AmelZulji this is a helpful place to start and may work for many cases. When I first posted this challenge I was envisioning including a local background correction as well like csaw
does with broader sliding windows. What do you all think about this? Would it be easy to compute and construct an equivalent test here for scATAC data?
Thanks @jeremymsimon, in my opinion, the correction in csaw
is needed because of the way how it counts (divide genome on equal bins and count within each bin). In ATAC from the other hand, data are aggregated from all cells in order to increase singal-to-noise ratio and then empirical peaks are called for the dataset.
The only concern might be "double dipping" as mentioned at the end of the first paragraph in this chapter of csaw
book https://bioconductor.org/books/3.13/csawBook/counting-reads-into-windows.html#background and as explained in this paper https://www.biorxiv.org/content/10.1101/2023.07.21.550107v1.full.pdf (though for single cell data and in genral related more to clusters markers rather than DEGs between condition within the same cluster).
Please correct me if im wrong or misunderstood your points
Any comments on this @jeremymsimon ? I would be interested to work it further on this if you have suggestion.
Regards, Amel
Hi @AmelZulji I'm not a statistician but I think these are separate issues. The issue regarding "double-dipping" is related to constructing your peak set in a condition-aware fashion, over which the differential test will then be conducted. In other words, if you call peaks in ConditionA, then call peaks in ConditionB, then merge into a union set and test over those windows, you risk losing error control.
csaw
's binning (like 10kb windows) would be separate from this and is a way of correcting for composition bias, which I suspect may be even more important in scATAC-seq data given how sparse the true enrichment is.
What I'm proposing here is to do the following:
csaw
does, incorporating the 10kb window counts as background correction@stemangiola apparently has methods (mentioned above) for doing some of this, but IMO it would be nice to work through a solution that includes the local background/composition bias correction that broad window counts provides
None of this has been published before for scATAC-seq, AFAIK, so this would provide a means for us to evaluate whether there is any benefit to testing in this fashion and context
Could be downstream of https://github.com/tidyomics/genomics-todos/issues/1
Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of
muscat
andcsaw