Motif analysis results - Githubissues

alvingao90 commented 3 years ago

Hi Tim, Thanks for your excellent program for scATAC-seq analysis. Here I have a question about the motif analysis with Signac. I focus on cattle and there are 4 samples total. Below is the full code I used.

#=======================================
##Step 1. Creating a common peak set
library(Signac)
library(Seurat)
library(GenomicRanges)
library(future)

#Read in peak sets
peaks.co <- read.table("/sample1/peaks.bed", col.names = c("chr", "start", "end"))
peaks.t1 <- read.table("/sample2/peaks.bed", col.names = c("chr", "start", "end"))
peaks.t2 <- read.table("/sample3/peaks.bed", col.names = c("chr", "start", "end"))
peaks.t3 <- read.table("/sample4/peaks.bed", col.names = c("chr", "start", "end"))

#Convert to genomic ranges
gr.co <- makeGRangesFromDataFrame(peaks.co)
gr.t1 <- makeGRangesFromDataFrame(peaks.t1)
gr.t2 <- makeGRangesFromDataFrame(peaks.t2)
gr.t3 <- makeGRangesFromDataFrame(peaks.t3)

#Create a unified set of peaks to quantify in each dataset
combined.peaks <- reduce(x = c(gr.co, gr.t1, gr.t2, gr.t3))

#Filter out bad peaks based on length
peakwidths <- width(combined.peaks)
combined.peaks <- combined.peaks[peakwidths  < 10000 & peakwidths > 20]

##Step 2. Create Fragment objects
md.co <- read.table("/sample1/singlecell.csv", stringsAsFactors = F, sep = ",", header = T, row.names = 1)[-1, ]
md.t1 <- read.table("/sample2/singlecell.csv", stringsAsFactors = F, sep = ",", header = T, row.names = 1)[-1, ]
md.t2 <- read.table("/sample3/singlecell.csv", stringsAsFactors = F, sep = ",", header = T, row.names = 1)[-1, ]
md.t3 <- read.table("/sample4/singlecell.csv", stringsAsFactors = F, sep = ",", header = T, row.names = 1)[-1, ]

#perform an initial filtering of low count cells
md.co <- md.co[md.co$passed_filters > 500, ]
md.t1 <- md.t1[md.t1$passed_filters > 500, ]
md.t2 <- md.t2[md.t2$passed_filters > 500, ]
md.t3 <- md.t3[md.t3$passed_filters > 500, ]

#create fragment objects
frags.co <- CreateFragmentObject(path = "/sample1/fragments.tsv.gz", cells = rownames(md.co))
frags.t1 <- CreateFragmentObject(path = "/sample2/fragments.tsv.gz", cells = rownames(md.t1))
frags.t2 <- CreateFragmentObject(path = "/sample3/fragments.tsv.gz", cells = rownames(md.t2))
frags.t3 <- CreateFragmentObject(path = "/sample4/fragments.tsv.gz", cells = rownames(md.t3))

##Step 3. Quantify peaks in each dataset
pbmcco.counts <- FeatureMatrix(fragments = frags.co, features = combined.peaks, cells = rownames(md.co))
pbmct1.counts <- FeatureMatrix(fragments = frags.t1, features = combined.peaks, cells = rownames(md.t1))
pbmct2.counts <- FeatureMatrix(fragments = frags.t2, features = combined.peaks, cells = rownames(md.t2))
pbmct3.counts <- FeatureMatrix(fragments = frags.t3, features = combined.peaks, cells = rownames(md.t3))

##Step 4. Create the objects
pbmcco_assay <- CreateChromatinAssay(pbmcco.counts, fragments = frags.co)
pbmcco <- CreateSeuratObject(pbmcco_assay, assay = "ATAC")
pbmct1_assay <- CreateChromatinAssay(pbmct1.counts, fragments = frags.t1)
pbmct1 <- CreateSeuratObject(pbmct1_assay, assay = "ATAC")
pbmct2_assay <- CreateChromatinAssay(pbmct2.counts, fragments = frags.t2)
pbmct2 <- CreateSeuratObject(pbmct2_assay, assay = "ATAC")
pbmct3_assay <- CreateChromatinAssay(pbmct3.counts, fragments = frags.t3)
pbmct3 <- CreateSeuratObject(pbmct3_assay, assay = "ATAC")

##Step 5. Merge objects
#add information to identify dataset of origin
pbmcco$dataset <- 'pbmcco'
pbmct1$dataset <- 'pbmct1'
pbmct2$dataset <- 'pbmct2'
pbmct3$dataset <- 'pbmct3'

#merge all datasets, adding a cell ID to make sure cell names are unique
combined <- merge(x = pbmcco, y = list(pbmct1, pbmct2, pbmct3), add.cell.ids = c("co", "t1", "t2", "t3"))
combined[["ATAC"]]
combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 20)
combined <- RunSVD(combined)
combined <- RunUMAP(combined, dims = 2:50, reduction = 'lsi')
combined <- FindNeighbors(object = combined, reduction = 'lsi', dims = 2:10)
combined <- FindClusters(object = combined, verbose = FALSE, algorithm = 3)

##Step 6. Motif analysis
library(JASPAR2020)
library(TFBSTools)
library(BSgenome.Btaurus.UCSC.bosTau9)
library(patchwork)
set.seed(1234)

DefaultAssay(combined) <- 'ATAC'
class(combined[["ATAC"]])

pfm <- getMatrixSet(x = JASPAR2020, opts = list(species = 9606, all_versions = F))
combined <- AddMotifs(object = combined, genome = BSgenome.Btaurus.UCSC.bosTau9, pfm = pfm)
da_peaks <- FindMarkers(object = combined, ident.1 = 1, ident.2 = 2, only.pos = T, test.use = 'LR', latent.vars = 'nCount_ATAC')
top.da.peak <- rownames(da_peaks[da_peaks$p_val < 0.005, ])
enriched.motifs <- FindMotifs(object = combined, features = top.da.peak)
#=======================================

What confuses me is the ‘da_peaks’ results obtained from the function ‘FindMarkers’. After the above steps, I just got one region. This result seems to be unreliable. Could you please help to check my codes and see if there is anything wrong with them?

Thanks, Yahui

timoast commented 3 years ago

The code looks ok to me. Did you assess whether the thresholds used here for filtering cells are appropriate for your dataset? If they were too low, for example, you could have a lot of empty droplets in the dataset which could cause meaningless clusters and an inability to find differential peaks between them. I recommend looking at the distribution of total counts, as well as other QC metrics, and deciding on reasonable cutoffs to use for your dataset.

alvingao90 commented 3 years ago

Thanks for your reply, Tim. As your suggestion, I lower the filtering standards and set mt.xx$passed_filters from 500 to 10. The result numbers of regions are still very few. It seems to be my data problem. I'll check them out continually.

timoast commented 3 years ago

I was suggesting increasing the threshold

alvingao90 commented 3 years ago

Oops. Sorry for my mistake. Now I increase the 'passed_filters' to 1000 and use the function 'FindAllMarkers' to find overrepresented motifs between different clusters (command: da_peaks <- FindAllMarkers(object = combined, only.pos = T, test.use = 'LR', latent.vars = 'nCount_ATAC')) and got more regions than before. Thanks. Btw, one thing I need to point is about the 'pfm'. Since the species I focused on is livestock instead of human, the motif position frequency matrices from the JASPAR database should be vertebrates and the 'pfm' should be obtained like below pfm <- getMatrixSet(x = JASPAR2020, opts = list(tax_group = 'vertebrates', all_versions = F)). Is that right?

timoast commented 3 years ago

Yes, you should use whichever PFM database makes the most sense for your experiment

stuart-lab / signac

Motif analysis results #348