Error in aggregating two 10X Multiome datasets (GEX+ATAC)

nonafarbehi2018 commented 2 years ago

Hello, I am trying to aggregate 10X multiome data (GEX+ATAC) using Seurat and Signac and got this error

Binding matrix rows
Error in intI(i, n = d[1], dn[[1]], give.dn = FALSE) : 
  invalid character indexing

This is my current R session info

R version 4.0.0 (2020-04-24)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /share/ScratchGeneral/nonfar/.conda/envs/r_4/lib/libopenblasp-r0.3.9.so

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] future_1.21.0        GenomicRanges_1.40.0 GenomeInfoDb_1.24.2 
[4] IRanges_2.22.2       S4Vectors_0.26.1     BiocGenerics_0.34.0 
[7] Signac_1.3.0         SeuratObject_4.0.2   Seurat_4.0.3        

loaded via a namespace (and not attached):
  [1] Rtsne_0.15             colorspace_2.0-2       deldir_0.2-10         
  [4] ellipsis_0.3.2         ggridges_0.5.3         lsa_0.73.2            
  [7] XVector_0.28.0         rstudioapi_0.13        spatstat.data_2.1-0   
 [10] farver_2.1.0           leiden_0.3.8           listenv_0.8.0         
 [13] SnowballC_0.7.0        ggrepel_0.9.1          fansi_0.5.0           
 [16] codetools_0.2-18       splines_4.0.0          docopt_0.7.1          
 [19] RcppRoll_0.3.0         polyclip_1.10-0        jsonlite_1.7.2        
 [22] Rsamtools_2.4.0        ica_1.0-2              cluster_2.1.2         
 [25] png_0.1-7              uwot_0.1.10            ggforce_0.3.3         
 [28] shiny_1.6.0            sctransform_0.3.2      spatstat.sparse_2.0-0 
 [31] compiler_4.0.0         httr_1.4.2             assertthat_0.2.1      
 [34] Matrix_1.3-4           fastmap_1.1.0          lazyeval_0.2.2        
 [37] tweenr_1.0.2           later_1.2.0            htmltools_0.5.1.1     
 [40] tools_4.0.0            igraph_1.2.6           gtable_0.3.0          
 [43] glue_1.4.2             GenomeInfoDbData_1.2.3 RANN_2.6.1            
 [46] reshape2_1.4.4         dplyr_1.0.7            fastmatch_1.1-0       
 [49] Rcpp_1.0.7             slam_0.1-48            scattermore_0.7       
 [52] vctrs_0.3.8            Biostrings_2.56.0      nlme_3.1-152          
 [55] ggseqlogo_0.1          lmtest_0.9-38          stringr_1.4.0         
 [58] globals_0.14.0         mime_0.11              miniUI_0.1.1.1        
 [61] lifecycle_1.0.0        irlba_2.3.3            goftest_1.2-2         
 [64] MASS_7.3-54            zlibbioc_1.34.0        zoo_1.8-9             
 [67] scales_1.1.1           spatstat.core_2.3-0    promises_1.2.0.1      
 [70] spatstat.utils_2.2-0   RColorBrewer_1.1-2     reticulate_1.20       
 [73] pbapply_1.4-3          gridExtra_2.3          ggplot2_3.3.5         
 [76] rpart_4.1-15           stringi_1.7.3          BiocParallel_1.22.0   
 [79] rlang_0.4.11           pkgconfig_2.0.3        matrixStats_0.59.0    
 [82] bitops_1.0-7           qlcMatrix_0.9.7        lattice_0.20-44       
 [85] ROCR_1.0-11            purrr_0.3.4            tensor_1.5            
 [88] patchwork_1.1.1        htmlwidgets_1.5.3      cowplot_1.1.1         
 [91] tidyselect_1.1.1       parallelly_1.27.0      RcppAnnoy_0.0.18      
 [94] plyr_1.8.6             magrittr_2.0.1         R6_2.5.0              
 [97] generics_0.1.0         DBI_1.1.1              pillar_1.6.1          
[100] mgcv_1.8-36            fitdistrplus_1.1-5     survival_3.2-11       
[103] abind_1.4-5            RCurl_1.98-1.3         tibble_3.1.2          
[106] future.apply_1.7.0     crayon_1.4.1           KernSmooth_2.23-20    
[109] utf8_1.2.1             spatstat.geom_2.2-2    plotly_4.9.4.1        
[112] grid_4.0.0             data.table_1.14.0      sparsesvd_0.2         
[115] digest_0.6.27          xtable_1.8-4           tidyr_1.1.3           
[118] httpuv_1.6.1           munsell_0.5.0          viridisLite_0.4.0

and this is the code I used to aggregate the 2 datasets:

library(Seurat)
library(Signac)
library(GenomicRanges)
library(future)
plan("multiprocess", workers = 8)
options(future.globals.maxSize = 50000 * 1024^2) # for 50 Gb RAM
setwd("/mydirectory")

#just read 2 samples for Now. I have done the  QC, filtering, peak calling, Non-linear #dimension reduction and clustering on the data using Seurat and save it as rds
VD0 <- readRDS("VD0_line151.rds")
PD0 <- readRDS("PD0_line151.rds")
# Load objects into a list

# Work with only two for testing
data.list <- list(PD0,VD0)
names(data.list) <- c("PD0", "VD0")
#Prior to finding anchors, we perform standard preprocessing (log-normalization), 
#and identify variable features individually for each. Note that Seurat implements an #improved method for variable feature selection based on a variance stabilizing #transformation ("vst")

processed_data_list <- lapply(names(data.list), function(x){
    obj <- data.list[[x]]
    obj <- NormalizeData(data.list[[x]], verbose = FALSE)
    obj <- FindVariableFeatures(data.list[[x]], selection.method = "vst", nfeatures = 2000, verbose = FALSE)
    return(obj)
})
names(processed_data_list) <- names(data.list)
selected_features <- SelectIntegrationFeatures(object.list = processed_data_list)
data_anchors <- FindIntegrationAnchors(object.list = processed_data_list, anchor.features = selected_features)
# this command creates an 'integrated' data assay
combined_data <- IntegrateData(anchorset = data_anchors)

Many thanks Nona

timoast commented 2 years ago

Can you show the code used to process the two objects (Vd0 and PD0)? Can you also edit the issue title to be more informative?

nonafarbehi2018 commented 2 years ago

Thank you Tim for your response. Here is the code that I used to generate PD0 (it is the same for VD0):

#######
#PD0
#######
library(dplyr)
library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg38)
library(patchwork)
library(ggplot2)
library(cowplot)
set.seed(1234)

setwd (c("/10X_3pGEX_ATAC/210701/ATAC/PD0/outs"))
outputdir <- "/nonfar/analysis/multiome_2"
#load RNA and ATAC data
counts <- Read10X_h5("filtered_feature_bc_matrix.h5")
names(counts)
fragpath <- "atac_fragments.tsv.gz"

# get gene annotations for hg38
annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotation) <- "UCSC"
genome(annotation) <- "hg38"

# create a Seurat object containing the RNA adata
pbmc <- CreateSeuratObject(
  counts = counts$`Gene Expression`,
  assay = "RNA"
)

# create ATAC assay and add it to the object
pbmc[["ATAC"]] <- CreateChromatinAssay(
  counts = counts$Peaks,
  sep = c(":", "-"),
  fragments = "atac_fragments.tsv.gz",
  annotation = annotation
)
################
#Quality control
################
DefaultAssay(pbmc) <- "ATAC"
pbmc <- NucleosomeSignal(pbmc)
pbmc <- TSSEnrichment(pbmc)

VlnPlot(
  object = pbmc,
  features = c("nCount_RNA", "nCount_ATAC", "TSS.enrichment", "nucleosome_signal"),
  ncol = 4,
  pt.size = 0
)

# filter out low quality cells
pbmc <- subset(
  x = pbmc,
  subset = nCount_ATAC < 100000 &
    nCount_RNA < 25000 &
    nCount_ATAC > 1000 &
    nCount_RNA > 1000 &
    nucleosome_signal < 2 &
    TSS.enrichment > 1
)
#Peak calling
# call peaks using MACS2
peaks <- CallPeaks(pbmc, macs2.path = "/share/ScratchGeneral/nonfar/.conda/envs/r_4/bin/macs2")

# remove peaks on nonstandard chromosomes and in genomic blacklist regions
peaks <- keepStandardChromosomes(peaks, pruning.mode = "coarse")
peaks <- subsetByOverlaps(x = peaks, ranges = blacklist_hg38_unified, invert = TRUE)

# quantify counts in each peak
macs2_counts <- FeatureMatrix(
  fragments = Fragments(pbmc),
  features = peaks,
  cells = colnames(pbmc)
)

# create a new assay using the MACS2 peak set and add it to the Seurat object
pbmc[["peaks"]] <- CreateChromatinAssay(
  counts = macs2_counts,
  fragments = "atac_fragments.tsv.gz",
  annotation = annotation
)

#Gene expression data processing
DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc)
pbmc <- RunPCA(pbmc)

#DNA accessibility data processing
DefaultAssay(pbmc) <- "peaks"
pbmc <- FindTopFeatures(pbmc, min.cutoff = 5)
pbmc <- RunTFIDF(pbmc)
pbmc <- RunSVD(pbmc)
DepthCor(pbmc)

#Non-linear dimension reduction and clustering
pbmc <- RunUMAP(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindNeighbors(object = pbmc, reduction = 'lsi', dims = 2:30)
pbmc <- FindClusters(object = pbmc, verbose = FALSE, algorithm = 3)
DimPlot(object = pbmc, label = TRUE) + NoLegend()
#Create a gene activity matrix
gene.activities <- GeneActivity(pbmc)
# add the gene activity matrix to the Seurat object as a new assay and normalize it
pbmc[['RNA']] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(
  object = pbmc,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(pbmc$nCount_RNA)
)

saveRDS(pbmc, paste0(outdir,"_PD0_line151.rds"))

timoast commented 2 years ago

You need to use the same peak set across the two different ATAC assays that you want to merge. I'd suggest either combining the peak sets as shown here or choosing one set of peaks and quantifying the peaks in both datasets.

Here you're also overwriting the RNA assay (containing the actual gene expression measurements from the multiome experiment) with the gene activity values from the ATAC assay.

It's unclear what assay is being used for anchor-finding, but since you removed the gene expression assay it's either the gene activity or ATAC assay. I would recommend keeping the actual gene expression measurements, and using that RNA assay for integration.

satijalab / seurat

Error in aggregating two 10X Multiome datasets (GEX+ATAC) #4857