Freezing/Crashing in RStudio when reading in Multiomics data

zgb963 commented 1 month ago

Hi Signac team,

I've been trying to run the Joint RNA and ATAC analysis: 10x multiomic tutorial with my multiomics data that I processed with the 10x Genomics Cell Ranger ARC pipeline, but it's taking a very long time (hours & days) to just read in the atac__fragments.tsv.gz file, and it seems to be stuck on creating the ATAC assay and adding it to the Seurat object. For reference, the genome FASTA I used (downloaded from UCSC Genome Browser) to map the reads is from Macaca fascicularis. The genome annotation GTF (which has human hg38 gene names and macfas5 coordinates) was made with the software Liftoff.

I was running this on an HPC system with 32 cpu cores and 12000MB of memory per core.

Right now I have 7 samples. I don't have any issues reading the count matrices for each sample.

liftoff_1_MI5_V1_matrix<-Read10X_h5("~/1_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")
liftoff_2_MI5_V1_matrix<-Read10X_h5("~/2_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")
liftoff_4_MI5_V1_matrix<-Read10X_h5("~/4_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")
liftoff_5_MI5_V1_matrix<-Read10X_h5("~/5_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")
liftoff_6_MI5_V1_matrix<-Read10X_h5("~/6_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")
liftoff_7_MI5_V1_matrix<-Read10X_h5("~/7_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")
liftoffe_8_MI5_V1_matrix<-Read10X_h5("~/8_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")

However, when I'm reading in the atac fragments files (which are ~1-3GB each), it takes a long time to load in and sometimes it takes so long that it crashes RStudio. Is there a better function I can use to read in these files faster?

liftoff_1_MI5_V1_frags<-read.delim("~/1_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")
liftoff_2_MI5_V1_frags<-read.delim("~/2_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")
liftoff_4_MI5_V1_frags<-read.delim("~/4_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")
liftoff_5_MI5_V1_frags<-read.delim("~/5_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")
liftoff_6_MI5_V1_frags<-read.delim("~/6_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")
liftoff_7_MI5_V1_frags<-read.delim("~/7_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")
liftoff_8_MI5_V1_frags<-read.delim("~/8_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")

I've also tried reading in the atac fragment files in parallel using the parallel R package, and it still takes a long time to run. I saved the matrix files and the fragment files as RDS files to see if loading would be faster but that doesn't seem to be the case.

# Set the number of cores to use
n_cores <- detectCores() - 1

# List of file paths for count matrices and fragment files
matrix_files <- c(
  "~/macfas5liftoff_MI5_multiomics_rds_files/liftoff_1_MI5_V1_matrix_10_2_24.rds",
  "~/macfas5liftoff_MI5_multiomics_rds_files/liftoff_2_MI5_V1_matrix_10_2_24.rds",
  "~/macfas5liftoff_MI5_multiomics_rds_files/liftoff_4_MI5_V1_matrix_10_2_24.rds",
  "~/macfas5liftoff_MI5_multiomics_rds_files/liftoff_5_MI5_V1_matrix_10_2_24.rds",
  "~/macfas5liftoff_MI5_multiomics_rds_files/liftoff_6_MI5_V1_matrix_10_2_24.rds",
  "~/macfas5liftoff_MI5_multiomics_rds_files/liftoff_7_MI5_V1_matrix_10_2_24.rds",
  "~/macfas5liftoff_MI5_multiomics_rds_files/liftoff_8_MI5_V1_matrix_10_2_24.rds"
)

# Read count matrices in parallel
count_matrices <- mclapply(matrix_files, readRDS, mc.cores = n_cores)

# Assign each matrix to a separate object. R counting starts at 1 unlike python
liftoff_1_MI5_V1_matrix <- count_matrices[[1]]
liftoff_2_MI5_V1_matrix <- count_matrices[[2]]
liftoff_4_MI5_V1_matrix <- count_matrices[[3]]
liftoff_5_MI5_V1_matrix <- count_matrices[[4]]
liftoff_6_MI5_V1_matrix <- count_matrices[[5]]
liftoff_7_MI5_V1_matrix <- count_matrices[[6]]
liftoff_8_MI5_V1_matrix <- count_matrices[[7]]

I tried reading in just the first sample so that I can create the Seurat Object containing the RNA data and then create an ATAC assay and adding it to the object

liftoff_1_MI5_V1_matrix<-Read10X_h5("~/1_MI5_V1_macfas5liftoff_filtered_feature_bc_matrix.h5")
liftoff_1_MI5_V1_frags<-read.delim("~/1_MI5_V1_macfas5liftoff_atac_fragments.tsv.gz", header = FALSE, sep = "\t")

When I get to creating the objects, it's creates the Seurat object as normal, but then it gets stuck on creating the ATAC assay so it doesn't get created and isn't a part of the Seurat object.

# create a Seurat object containing the RNA adata...done
liftoff_1_MI5_V1_SO <- CreateSeuratObject(
  counts = liftoff_1_MI5_V1_matrix$`Gene Expression`,
  assay = "RNA"
)

# create ATAC assay and add it to the object (what it gets stuck on)
liftoff_1_MI5_V1_SO[["ATAC"]] <- CreateChromatinAssay(
  counts = liftoff_1_MI5_V1_matrix$Peaks,
  sep = c(":", "-"),
  fragments = liftoff_1_MI5_V1_frags
)

I'm not sure why this is happening or if there's something wrong with the atac fragments file?

This is what the atac fragments file looks like when it's loaded in. I noticed that inside the file doesn't have tab separated columns with the chromosome number, start and end position, cell barcode, and peak score. All the data is in one column. Could that be the reason why it's not loading in prooperly? Or is it just because the file is gzipped?

View(sample_1_frags)
1 # id=cellranger-arc-1_MI5_V1_240716_macfas5liftoff 
2 # description=cellranger-arc-1_MI5_V1_240716_macfas5liftoff    
3 #    
4 # pipeline_name=cellranger-arc    
5 # pipeline_version=cellranger-arc-2.0.2    
6 #
7 # reference_path=/home/genevieve.baddoo1-umw/macaque_multiomics/macfas5_liftoff
8 # reference_fasta_hash=98d42c2eeb5d3e954e32cdc11e0b35da174e62ae 
9 # reference_gtf_hash=1c05246c871fa6fa018f81f7b79e630ba87e4f8f 
10 # reference_version= 
11 # mkref_version=cellranger-arc-2.0.2 
12 # 
13 # primary_contig=chr1 
14 # primary_contig=chr10 
15 # primary_contig=chr10_KE145718_random 
16 # primary_contig=chr10_AQIA01009600_random 
17 # primary_contig=chr10_KE145721_random 
18 # primary_contig=chr11 
19 # primary_contig=chr11_KE145731_random 
20 # primary_contig=chr11_KE145732_random 
21 # primary_contig=chr11_KE145734_random 
22 # primary_contig=chr11_AQIA01013624_random 
23 # primary_contig=chr11_AQIA01013619_random 
24 # primary_contig=chr12 
25 # primary_contig=chr12_KE145746_random 
26 # primary_contig=chr12_KE145748_random 
27 # primary_contig=chr13 
28 # primary_contig=chr13_KE145751_random 
29 # primary_contig=chr14 
30 # primary_contig=chr14_KE145763_random 
31 # primary_contig=chr14_KE145765_random 
32 # primary_contig=chr14_KE145768_random 
33 # primary_contig=chr14_KE145774_random 
34 # primary_contig=chr15

timoast commented 1 month ago

Please take another look at the vignettes on the Signac website, it is not necessary or recommended to read the fragment file into R. You just need to give the path to the file on disk when creating your chromatin assay

zgb963 commented 1 month ago

@timoast Yes I realized soon after I posted and forgot to mention that I fixed it, thanks

stuart-lab / signac

Freezing/Crashing in RStudio when reading in Multiomics data #1798