stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
327 stars 88 forks source link

Merging of 2 sample datasets #45

Closed LooLipin closed 4 years ago

LooLipin commented 4 years ago

Hi Tim,

I have 2 samples of different treatment conditions merged. I would like to do differential peak analysis on this merged object. Using instructions from the FAQ (https://satijalab.org/signac/articles/faq.html), I merged fragments from my 2 samples and then utilized FeatureMatrix to generate a feature x cell matrix, with unified peaks. Weirdly the number of cells decreases after I generate the matrix. Here is what I have done so far:

library(Signac)
library(Seurat)
library(GenomeInfoDb)
library(EnsDb.Mmusculus.v79)
library(ggplot2)
set.seed(1234)

#read Sham files and create Sham object
counts_sham <- Read10X_h5("/Documents/Sham/outs/filtered_peak_bc_matrix.h5")
metadata_sham <- read.csv(
     file = "/Documents/Sham/outs/singlecell.csv",header = TRUE,
     row.names = 1
     )

spinalcord_sham <- CreateSeuratObject(
    counts = counts_sham,
    assay = 'peaks',
     project = 'sham',
     min.cells = 1,
     meta.data = metadata_sham
 )

"Warning message:
In CreateSeuratObject(counts = counts_sham, assay = "peaks", project = "sham",  :
  Some cells in meta.data not present in provided counts matrix."

#read SNI files and create SNI object
counts_SNI <- Read10X_h5("/Documents/SNI/outs/filtered_peak_bc_matrix.h5")
metadata_SNI <- read.csv(
     file = "/Documents/SNI/outs/singlecell.csv",header = TRUE,
     row.names = 1
     )

spinalcord_SNI <- CreateSeuratObject(
    counts = counts_SNI,
    assay = 'peaks',
     project = 'SNI',
     min.cells = 1,
     meta.data = metadata_SNI
 )

"Warning message:
In CreateSeuratObject(counts = counts_sham, assay = "peaks", project = "sham",  :
  Some cells in meta.data not present in provided counts matrix."

spinalcord_sham
"An object of class Seurat 
72986 features across 1930 samples within 1 assay 
Active assay: peaks (72986 features)"

spinalcord_SNI

"An object of class Seurat 
75997 features across 5665 samples within 1 assay 
Active assay: peaks (75997 features)"

#merge objects
merged.obj <- MergeWithRegions(
  object.1 = spinalcord_sham,
  object.2 = spinalcord_SNI,
  sep.1 = c(":", "-"),
  sep.2 = c(":", "-")
)

merged.obj
"An object of class Seurat 
66013 features across 7595 samples within 1 assay 
Active assay: peaks (66013 features)"

#set fragment path for merged fragment file (generated with bgzip and tabix)
fragment.path <- "fragments_merged.tsv.gz"
merged.obj <- SetFragments(object = merged.obj, file = fragment.path)

#extract gene coordinates from Ensembl, and ensure name formatting is consistent with Seurat object
gene.coords <- genes(EnsDb.Mmusculus.v79, filter = ~ gene_biotype == "protein_coding")
seqlevelsStyle(gene.coords) <- 'UCSC'
genebody.coords <- keepStandardChromosomes(gene.coords, pruning.mode = 'coarse')
genebodyandpromoter.coords <- Extend(x = gene.coords, upstream = 2000, downstream = 0)

#create a gene by cell matrix
gene.activities <- FeatureMatrix(
fragments = fragment.path,
features = genebodyandpromoter.coords,
cells = colnames(merged.obj),
chunk = 10
)

"Extracting reads overlapping genomic regions
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=13m 20s
Constructing matrix"

#qc
merged.obj <- NucleosomeSignal(object = merged.obj)
merged.obj$pct_reads_in_peaks <- merged.obj$peak_region_fragments / merged.obj$passed_filters * 100

"Error in names(data.return) <- rep.int(x = colnames(x = x), times = length(x = i)) : 
  'names' attribute [7595] must be the same length as the vector [7583]"

#I checked the dimensions for gene.activities and the number of cells is less than the merged object
dim(gene.activities)
[1] 21532  7583

It looks like I'm losing cells after the gene coordinate extraction step, decreasing from 7595 to 7583. Please advise.

Thanks, Lipin

timoast commented 4 years ago

Hi Lipin,

This could happen if some cells had zero counts for all genes. I fixed this a few weeks ago so that the cell is still retained (https://github.com/timoast/signac/commit/7053478d81d78d5efb079b2037f6d8df71b0d15b) , can you try this using the version on the current develop branch?

Alternatively, you could have cells with the same name in the two objects, which are then being counted as one cell when creating the matrix from the fragment file. You can check if there is a cell name clash by running:

sum(colnames(spinalcord_sham) %in% colnames(spinalcord_SNI))

If this doesn't return 0, then there are cells with the same name in the two objects. If so, you will need to rename the cells before merging the fragment files. For example:

# in R
# first add prefix to cell names in Seurat objects
spinalcord_sham <- RenameCells(object = spinalcord_sham, add.cell.id = "sham")
spinalcord_SNI <- RenameCells(object = spinalcord_SNI, add.cell.id = "SNI")
# on command line
# add same cell prefix to the fragment files on disk
zcat fragments_sham.tsv.bgz | awk 'BEGIN {FS=OFS="\t"} {print $1,$2,$3,"sham_"$4,$5}' - > fragments_prefixed_sham.tsv
zcat fragments_sni.tsv.bgz | awk 'BEGIN {FS=OFS="\t"} {print $1,$2,$3,"SNI_"$4,$5}' - > fragments_prefixed_sni.tsv

# merge fragment files
sort -m -k 1,1V -k2,2n fragments_prefixed_sham.tsv fragments_prefixed_sni.tsv > fragments_merged.tsv

# block gzip
bgzip -@ 4 fragments_merged.tsv

# index
tabix --preset=bed fragments_merged.tsv.gz
LooLipin commented 4 years ago

Hi Tim,

Thanks for the suggestion. That worked like a charm.

However, I've run into a different problem downstream. When I try to add the RNA gene activity matrix as RNA assay, I get the error message:

merged.obj[['RNA']]` <- CreateAssayObject(counts = gene.activities)
"Warning: Non-unique features (rownames) present in the input matrix, making unique
Error: Feature names of counts matrix cannot be empty"

When I look closer at the gene.activities matrix, I observe one rowname missing, under the gene Boll (see attached screenshot). This is probably the empty feature name causing the error. I was wondering if this empty feature name is a byproduct of making unique features? And if so, is there a fix?

Thanks, Lipin

Screen Shot 2019-12-10 at 1 21 08 pm
timoast commented 4 years ago

I don't think this is a byproduct of making the feature names unique, you likely have a feature with an Ensemble ID but no gene name (assuming you're following the examples here: https://satijalab.org/signac/articles/mouse_brain_vignette.html#create-a-gene-activity-matrix)

If you run:

gene.activities <- gene.activities[rownames(gene.activities)!="",]

That will remove the row from the matrix.

timoast commented 4 years ago

Hi Lipin, are you sure you're running this on the right assay? Can you post the full code you're using?

LooLipin commented 4 years ago

Hi Tim, I deleted the last question. There was a typo in one of my objects.

timoast commented 4 years ago

Ok no worries

LooLipin commented 4 years ago

Hi Tim, I have a follow up question regarding predicted id label transfer.

Only 20 of my 51 RNA clusters were successfully transferred but the labels on my Dimplots show all predicted.ids. There are less cells in my ATAC assay than in the RNA so I believe the 20 transferred labels is true. Is there any way I could change the label on the Dimplots to match the 20 transferred labels?

merged.obj_rna <-readRDS("/scratch/scRNAseq/Seurat/SCT_151119/spinalcord.rds")
merged.obj_rna <- FindVariableFeatures(
  object = merged.obj_rna,
  nfeatures = 5000
)

transfer.anchors <- FindTransferAnchors(
  reference = merged.obj_rna,
  query = merged.obj,
  reduction = 'cca',
  dims = 1:40
)

predicted.labels <- TransferData(
  anchorset = transfer.anchors,
  refdata = merged.obj_rna$subclass,
  weight.reduction = merged.obj[['lsi']],
  dims = 1:40
)

merged.obj <- AddMetaData(object = merged.obj, metadata = predicted.labels)

pdf("DimPlot_RNAvsATAC.pdf", width =12)
plot1 <- DimPlot(merged.obj_rna, group.by = 'subclass', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scRNA-seq')
plot2 <- DimPlot(merged.obj, group.by = 'predicted.id', label = TRUE, repel = TRUE) + NoLegend() + ggtitle('scATAC-seq')
CombinePlots(list(plot1,plot2), ncol = 2)
dev.off()

for(i in levels(merged.obj)) {
  cells_to_reid <- WhichCells(merged.obj, idents = i)
  newid <- names(sort(table(merged.obj$predicted.id[cells_to_reid]),decreasing=TRUE))[1]
  Idents(merged.obj, cells = cells_to_reid) <- newid
}

head(Idents(merged.obj))
" sham_AAACGAAAGGGAAATG-1 sham_AAACGAACAAGCAGGT-1 sham_AAACGAAGTCAATCCA-1 
                     E5                      E2                      I1 
sham_AAACGAATCCATGTTT-1 sham_AAACTCGAGTCAGGAC-1 sham_AAACTCGCATCCTCGT-1 
                     I2                      E4                      I1 
20 Levels: Low2 E15 I17 E14 I15 E7 E13 E10 I8 I11 E6 E4 E8 E3 I5 E2 E5 ... I1"

DimPlot_RNAvsATAC.pdf

timoast commented 4 years ago

I'm not quite sure what you mean. What do you mean by "20 of my 51 RNA clusters were successfully transferred"? Each cell will get a prediction score for each class in the transferred labels. It looks like the label transfer worked well for clusters that are well-resolved in your scATAC-seq. You can remove cells with low prediction scores or set their classification to unknown with the following:

merged.obj$class <- ifelse(merged.obj$prediction.score.max > 0.5, merged.obj$predicted.id, "unknown")
LooLipin commented 4 years ago

There is a discrepancy in the number of IDs on the dimplots vs what I see in the levels (newid) of my merged.obj. If all 51 clusters were transferred successfully, I would expect to see 51 levels in Idents(merged.obj) but there are only 20. I don't quite understand how Dimplot manages to assign all 51 cluster ids onto the scATACseq plot when there are only 20 predicted IDs. Does this mean that some of the cells that currently have low predicted scores are being forced into the 20 IDs?

timoast commented 4 years ago

@LooLipin Sorry I still don't follow what you mean. Each cell will get a prediction score for each class in the transferred labels. If there are 51 different classes in the transferred labels, then each cell will have a score for each of the 51 classes. It's possible that for some of the classes there are no cells for which that class gave the highest score, so there may be less than 51 classes when you look at the predicted.id for each cell.

I don't understand the discrepancy between DimPlot and the levels you're mentioning. However, this is now quite unrelated to the original issue you raised here, so if you're still having issues with this please open a new issue that clearly explains this problem.