Example of scATAC-seq pseudobulk analysis for differential accessibility testing

jeremymsimon commented 1 year ago

Could be downstream of https://github.com/tidyomics/genomics-todos/issues/1

Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of muscat and csaw

paupaiz commented 1 year ago

Can I use ArchR for this example?

jeremymsimon commented 1 year ago

@paupaiz possibly, could you briefly describe more? My intention here is to be able to compare accessibility between experimental conditions (e.g. WT and mutant) where we have multiple replicates of each. The test would produce cluster-by-cluster results of genomic regions, log-fold-changes, and adjusted p-values.

The ArchR documentation on this here uses an example of one cell type vs another (ie finding marker regions that are specific to one cell type), which IMO is a slightly different problem, but as long as it does this in a replicate-aware fashion it could also work?

stemangiola commented 1 year ago

Could be downstream of #1

Work through multi-sample multi-condition differential test using pseudobulked counts, a la a hybrid of muscat and csaw

For multilevel pseudobulk analyses, tidybulk has incorporated glmmSeq and optimised for very large datasets.

https://github.com/stemangiola/tidybulk/blob/512a534bb3bc6e933a78ebbfa1bfaea6d2f6cb38/tests/testthat/test-bulk_methods_SummarizedExperiment.R#L481

see ?test_differential_abundance in the new github release

HPCell (https://github.com/stemangiola/HPCell), in development but functioning, scales multilevel analyses, to the cluster, with no coding overload. We will announce this functionality soon.

paupaiz commented 1 year ago

@stemangiola @jeremymsimon I developed condition-aware pseudobulking for ArchR motivated by this paper We will include this in the next release (1.0.3). Let me know if it would be useful to have an example here!

AmelZulji commented 12 months ago

I have working example using Signac and Deseq2. Am I eligible to contribute/provide the code? I am sorry for the basic question but could not find anything in the guidelines for contribution.

AmelZulji commented 11 months ago

@jeremymsimon, @stemangiola, @paupaiz What is your opinion - is there any benefit to go with SIgnac/DEseq2 example?

paupaiz commented 11 months ago

@AmelZulji would love to check out your post so I can answer!

AmelZulji commented 11 months ago

Hi @paupaiz, @jeremymsimon, @stemangiola

Here is the reproducible example the code:

# create directory for downloading files
dir.create("tmp")

# downmload files in the tmp directory
download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5",
              destfile = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_singlecell.csv",
              destfile = "tmp/atac_v1_pbmc_10k_singlecell.csv")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz",
              destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz")

download.file(url = "https://cf.10xgenomics.com/samples/cell-atac/1.0.1/atac_v1_pbmc_10k/atac_v1_pbmc_10k_fragments.tsv.gz.tbi",
              destfile = "tmp/atac_v1_pbmc_10k_fragments.tsv.gz.tbi")

library(Seurat) # for convinient functions
library(Signac) # for handling ATAC data

# construct assay with ATAC data
counts <- Read10X_h5(filename = "tmp/atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
metadata <- read.csv(
  file = "tmp/atac_v1_pbmc_10k_singlecell.csv",
  header = TRUE,
  row.names = 1
)

chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"),
  fragments = 'tmp/atac_v1_pbmc_10k_fragments.tsv.gz',
  min.cells = 10,
  min.features = 200
)

pbmc <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = metadata
)

# a basic processing adapted from https://stuartlab.org/signac/articles/pbmc_vignette 

# extract gene annotations from EnsDb
library(EnsDb.Hsapiens.v86)
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)

# change to UCSC style since the data was mapped to hg19
seqlevels(annotations) <- paste0('chr', seqlevels(annotations))
genome(annotations) <- "hg19"

# add the gene information to the object
Annotation(pbmc) <- annotations
pbmc <- subset(x = pbmc, subset = nCount_peaks > 3000 &  nCount_peaks < 30000)
pbmc <- RunTFIDF(pbmc)
pbmc <- FindTopFeatures(pbmc, min.cutoff = 'q0')
pbmc <- RunSVD(pbmc)
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3, resolution = 0.1)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
pbmc[["cell_type"]] <- Idents(pbmc)

# subset to speed up the process
pbmc_sub <- subset(pbmc, cell_type %in% c(0,1))

# add column with donor, sex and condition to simulate groups for testing
obj_meta <- pbmc_sub@meta.data
obj_meta[["donor"]] <- sample(x = paste0("donor_", 1:4), size = nrow(obj_meta), replace = T)
obj_meta[["barcode"]] <- rownames(obj_meta)

df <-
  tibble::tibble(
    donor = unique(obj_meta$donor),
    condition = rep(c("Control", "Disease"), c(2,2)),
    sex = rep(c("F", "M"), c(2))
  )

obj_meta <- dplyr::left_join(obj_meta, df)
rownames(obj_meta) <- obj_meta$barcode
pbmc_sub@meta.data <- obj_meta

# aggregate counts by summing up counts
pb_counts <- AggregateExpression(pbmc_sub, assays = "peaks", slot = "counts",group.by = "donor")$peaks

# adjust metadata
pb_meta <- pbmc_sub@meta.data[,c("condition", "sex", "donor")] |> unique()
rownames(pb_meta) <- pb_meta$donor

library(DESeq2)

# match ordering in metadata and count data
sample_meta <- pb_meta[match(colnames(pb_counts), rownames(pb_meta)),]

dds <- DESeqDataSetFromMatrix(pb_counts, 
                              colData = sample_meta, 
                              design = ~ sex + condition)
dds <- DESeq(dds)

res <- results(dds) |> as.data.frame()
head(res)

jeremymsimon commented 11 months ago

Thanks @AmelZulji this is a helpful place to start and may work for many cases. When I first posted this challenge I was envisioning including a local background correction as well like csaw does with broader sliding windows. What do you all think about this? Would it be easy to compute and construct an equivalent test here for scATAC data?

AmelZulji commented 11 months ago

Thanks @jeremymsimon, in my opinion, the correction in csawis needed because of the way how it counts (divide genome on equal bins and count within each bin). In ATAC from the other hand, data are aggregated from all cells in order to increase singal-to-noise ratio and then empirical peaks are called for the dataset.

The only concern might be "double dipping" as mentioned at the end of the first paragraph in this chapter of csaw book https://bioconductor.org/books/3.13/csawBook/counting-reads-into-windows.html#background and as explained in this paper https://www.biorxiv.org/content/10.1101/2023.07.21.550107v1.full.pdf (though for single cell data and in genral related more to clusters markers rather than DEGs between condition within the same cluster).

Please correct me if im wrong or misunderstood your points

AmelZulji commented 11 months ago

Any comments on this @jeremymsimon ? I would be interested to work it further on this if you have suggestion.

Regards, Amel

jeremymsimon commented 11 months ago

Hi @AmelZulji I'm not a statistician but I think these are separate issues. The issue regarding "double-dipping" is related to constructing your peak set in a condition-aware fashion, over which the differential test will then be conducted. In other words, if you call peaks in ConditionA, then call peaks in ConditionB, then merge into a union set and test over those windows, you risk losing error control.

csaw's binning (like 10kb windows) would be separate from this and is a way of correcting for composition bias, which I suspect may be even more important in scATAC-seq data given how sparse the true enrichment is.

What I'm proposing here is to do the following:

Pseudobulk signals such that you have a matrix of all peaks (and/or small windows) by all replicates/conditions, for each scATAC cluster. If using peaks, they would have been already identified in a condition-agnostic fashion, at least if you follow the typical published workflows
In parallel, construct 10kb (?) windows and compute those counts
Test for DA just like csaw does, incorporating the 10kb window counts as background correction

@stemangiola apparently has methods (mentioned above) for doing some of this, but IMO it would be nice to work through a solution that includes the local background/composition bias correction that broad window counts provides

None of this has been published before for scATAC-seq, AFAIK, so this would provide a means for us to evaluate whether there is any benefit to testing in this fashion and context

tidyomics / genomics-todos

Example of scATAC-seq pseudobulk analysis for differential accessibility testing #11