10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified

methornton commented 1 month ago

Discussed in https://github.com/stuart-lab/signac/discussions/1740

^{Originally posted by **methornton** July 9, 2024} Hello! I am processing some cancer cell line 10X plus Mulitome data and I have kind of an outlier as far as the hg38_blacklist_ratio. First, I had to add the metadata from cellranger file 'per_barcode_metrics.csv' file. I used some info from this Q&A site for generating the blacklist_ratio. ``` # add blacklist ratio and fraction of reads in peaks ap1$pct_reads_in_peaks <- ap1$atac_peak_region_fragments / ap1$atac_fragments * 100 ap1$blacklist_region_fragments <- FractionCountsInRegion(object = ap1, assay = 'ATAC', regions = blacklist_hg38) ap1$blacklist_ratio <- ap1$blacklist_region_fragments / ap1$atac_peak_region_fragments ``` I was able to generate the other QC metrics and here are the images. ![09Jul24_AP1_Vlnplot](https://github.com/stuart-lab/signac/assets/5246174/2b3af811-490d-4604-a60f-7eb28b7a3202) ![09Jul24_AP1_Vlnplot_hg38_unified](https://github.com/stuart-lab/signac/assets/5246174/085f4505-b7c8-4f2f-aba4-6c69754bdf79) Any information or advice is greatly appreciated. The nucleosome values are also fairly low. What does that really mean? To me all of the other values look decent. Did I mess up the blacklist ratio calculation? TIA

timoast commented 1 month ago

The FractionCountsInRegion function will return the fraction of counts in the blacklisted region, so you shouldn't divide this again by the total fragments in the cell. However, it still does not make sense that the values are centered around one. Can you show the complete code and output of sessionInfo()?

methornton commented 1 month ago

Hello! Absolutely. Thank you!

library(Signac)
library(Seurat)
library(AnnotationHub)
library(BSgenome.Hsapiens.UCSC.hg38)
library(dplyr)
library(ggplot2)
library(patchwork)
library(reticulate)
use_python("/home/met/.conda/envs/r-reticulate/bin/python3")

# This will get the proper version of the Ensembl for your data
hub <- AnnotationHub()
query(hub, c("ensdb","homo sapiens","112"))
ensdb <- hub[["AH116860"]]

#set.seed(1234)

today <- Sys.Date()
dt <- format(today, format="%d%b%y")

# import the 10X hdf5 file with both data
ap1.10x <- Read10X_h5("/dataVol/data/Petrosyan/08Jul24/AP1/outs/filtered_feature_bc_matrix.h5")

# extract the RNA and ATAC data
ap1_rna_counts <- ap1.10x$`Gene Expression`
ap1_atac_counts <- ap1.10x$Peaks

ap1_metadata <- read.csv(file = "/dataVol/data/Petrosyan/08Jul24/AP1/outs/per_barcode_metrics.csv", sep = ",", header = TRUE, row.names = 1)

# Create seurat objects
ap1 <- CreateSeuratObject(counts = ap1_rna_counts, meta.data = ap1_metadata)

# Find percentage of mitochondrial genes
ap1[["percent.mt"]] <- PercentageFeatureSet(ap1, pattern = "^MT-")

ap1_grange.counts <- StringToGRanges(rownames(ap1_atac_counts), sep = c(":", "-"))
ap1_grange.use <- seqnames(ap1_grange.counts) %in% standardChromosomes(ap1_grange.counts)
ap1_atac_counts <- ap1_atac_counts[as.vector(ap1_grange.use), ]
ap1_annotations <- GetGRangesFromEnsDb(ensdb = ensdb)
seqlevelsStyle(ap1_annotations) <- 'UCSC'
genome(ap1_annotations) <- "hg38"

ap1_frag.file <- "/dataVol/data/Petrosyan/08Jul24/AP1/outs/atac_fragments.tsv.gz"
chrom_assay <- CreateChromatinAssay(
   counts = ap1_atac_counts,
   sep = c(":", "-"),
   genome = 'hg38',
   fragments = ap1_frag.file,
   min.cells = 10,
   annotation = ap1_annotations
 )
ap1[["ATAC"]] <- chrom_assay

DefaultAssay(ap1) <- "ATAC"

ap1 <- NucleosomeSignal(ap1)
ap1 <- TSSEnrichment(ap1, fast = FALSE)

# add blacklist ratio and fraction of reads in peaks
ap1$pct_reads_in_peaks <- ap1$atac_peak_region_fragments / ap1$atac_fragments * 100
ap1$blacklist_region_fragments <- FractionCountsInRegion(object = ap1, assay = 'ATAC', regions = blacklist_hg38)
ap1$blacklist_ratio <- ap1$blacklist_region_fragments / ap1$atac_peak_region_fragments

png(filename=paste(dt,"_AP1_TSSEnrichment_v_nCountATAC.png", sep=""), width = 6*300, height = 5*300, res = 300)
DensityScatter(ap1, x = 'nCount_ATAC', y = 'TSS.enrichment', log_x = TRUE, quantiles = TRUE)
dev.off()

ap1$high.tss <- ifelse(ap1$TSS.enrichment > 3, 'High', 'Low')

png(filename=paste(dt,"_AP1_TSS_enrichment.png", sep=""), width = 12*300, height = 5*300, res = 300)
TSSPlot(ap1, group.by = 'high.tss') + NoLegend()
dev.off()

ap1$nucleosome_group <- ifelse(ap1$nucleosome_signal > 4, 'NS > 4', 'NS < 4')

png(filename=paste(dt,"_AP1_Nucleosome_enrichment.png", sep=""), width = 12*300, height = 5*300, res = 300)
FragmentHistogram(object = ap1, group.by = 'nucleosome_group')
dev.off()

## vln plot for the filtering
png(filename=paste(dt,"_AP1_Vlnplot.png", sep=""), width = 13*300, height = 10*300, res = 300)
VlnPlot(ap1, features = c("nCount_RNA", "nCount_ATAC", "TSS.enrichment", "nucleosome_signal", "blacklist_ratio", "pct_reads_in_peaks", "percent.mt"), ncol = 4, log = TRUE, pt.size = 0.1) + NoLegend()
dev.off()

> sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS/LAPACK: /usr/local/lib/libopenblas_zenp-r0.3.18.so;  LAPACK version 3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Los_Angeles
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] reticulate_1.38.0                 patchwork_1.2.0                  
 [3] ggplot2_3.5.1                     dplyr_1.1.4                      
 [5] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.72.0                  
 [7] rtracklayer_1.64.0                BiocIO_1.14.0                    
 [9] Biostrings_2.72.1                 XVector_0.44.0                   
[11] GenomicRanges_1.56.1              GenomeInfoDb_1.40.1              
[13] IRanges_2.38.1                    S4Vectors_0.42.1                 
[15] AnnotationHub_3.12.0              BiocFileCache_2.12.0             
[17] dbplyr_2.5.0                      BiocGenerics_0.50.0              
[19] Seurat_5.1.0                      SeuratObject_5.0.2               
[21] sp_2.1-4                          Signac_1.13.0                    

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22            splines_4.4.1              
  [3] later_1.3.2                 bitops_1.0-7               
  [5] filelock_1.0.3              tibble_3.2.1               
  [7] polyclip_1.10-6             XML_3.99-0.17              
  [9] fastDummies_1.7.3           lifecycle_1.0.4            
 [11] globals_0.16.3              lattice_0.22-6             
 [13] MASS_7.3-61                 magrittr_2.0.3             
 [15] plotly_4.10.4               yaml_2.3.8                 
 [17] httpuv_1.6.15               sctransform_0.4.1          
 [19] spam_2.10-0                 spatstat.sparse_3.1-0      
 [21] cowplot_1.1.3               pbapply_1.7-2              
 [23] DBI_1.2.3                   RColorBrewer_1.1-3         
 [25] abind_1.4-5                 zlibbioc_1.50.0            
 [27] Rtsne_0.17                  purrr_1.0.2                
 [29] RCurl_1.98-1.14             rappdirs_0.3.3             
 [31] GenomeInfoDbData_1.2.12     ggrepel_0.9.5              
 [33] irlba_2.3.5.1               listenv_0.9.1              
 [35] spatstat.utils_3.0-5        goftest_1.2-3              
 [37] RSpectra_0.16-1             spatstat.random_3.2-3      
 [39] fitdistrplus_1.1-11         parallelly_1.37.1          
 [41] leiden_0.4.3.1              codetools_0.2-20           
 [43] DelayedArray_0.30.1         RcppRoll_0.3.0             
 [45] tidyselect_1.2.1            UCSC.utils_1.0.0           
 [47] matrixStats_1.3.0           spatstat.explore_3.2-7     
 [49] GenomicAlignments_1.40.0    jsonlite_1.8.8             
 [51] progressr_0.14.0            ggridges_0.5.6             
 [53] survival_3.7-0              tools_4.4.1                
 [55] ica_1.0-3                   Rcpp_1.0.12                
 [57] glue_1.7.0                  SparseArray_1.4.8          
 [59] gridExtra_2.3               MatrixGenerics_1.16.0      
 [61] withr_3.0.0                 BiocManager_1.30.23        
 [63] fastmap_1.2.0               fansi_1.0.6                
 [65] digest_0.6.36               R6_2.5.1                   
 [67] mime_0.12                   colorspace_2.1-0           
 [69] scattermore_1.2             tensor_1.5                 
 [71] spatstat.data_3.1-2         RSQLite_2.3.7              
 [73] utf8_1.2.4                  tidyr_1.3.1                
 [75] generics_0.1.3              data.table_1.15.4          
 [77] httr_1.4.7                  htmlwidgets_1.6.4          
 [79] S4Arrays_1.4.1              uwot_0.2.2                 
 [81] pkgconfig_2.0.3             gtable_0.3.5               
 [83] blob_1.2.4                  lmtest_0.9-40              
 [85] htmltools_0.5.8.1           dotCall64_1.1-1            
 [87] scales_1.3.0                Biobase_2.64.0             
 [89] png_0.1-8                   reshape2_1.4.4             
 [91] rjson_0.2.21                nlme_3.1-165               
 [93] curl_5.2.1                  zoo_1.8-12                 
 [95] cachem_1.1.0                stringr_1.5.1              
 [97] BiocVersion_3.19.1          KernSmooth_2.23-24         
 [99] parallel_4.4.1              miniUI_0.1.1.1             
[101] AnnotationDbi_1.66.0        restfulr_0.0.15            
[103] pillar_1.9.0                grid_4.4.1                 
[105] vctrs_0.6.5                 RANN_2.6.1                 
[107] promises_1.3.0              xtable_1.8-4               
[109] cluster_2.1.6               cli_3.6.3                  
[111] compiler_4.4.1              Rsamtools_2.20.0           
[113] rlang_1.1.4                 crayon_1.5.3               
[115] future.apply_1.11.2         plyr_1.8.9                 
[117] stringi_1.8.4               viridisLite_0.4.2          
[119] deldir_2.0-4                BiocParallel_1.38.0        
[121] munsell_0.5.1               lazyeval_0.2.2             
[123] spatstat.geom_3.2-9         Matrix_1.7-0               
[125] RcppHNSW_0.6.0              bit64_4.0.5                
[127] future_1.33.2               KEGGREST_1.44.1            
[129] shiny_1.8.1.1               SummarizedExperiment_1.34.0
[131] ROCR_1.0-11                 igraph_2.0.3               
[133] memoise_2.0.1               fastmatch_1.1-4            
[135] bit_4.0.5

methornton commented 1 month ago

So I did notice that it is really really close to 1 and if I set the filtering blacklist_ration < 1.015. I still get most of the clusters that I would get without filtering.

methornton commented 1 month ago

I also used the gencode genome fasta with cellranger-arc genomeGenerate. I had read somewhere that the differences beyond having a "chr" were that UCSC style begins on '1' and Ensembl has no "chr" and begins on '0'. Could that be causing the issue? I have another set with the gencode gtf and ensembl fa. I can check.

stuart-lab / signac

10X multiome subsetting cancer cell line with blacklist_ratio ~ 1 for both hg38_blacklist and hg38_blacklist_unified #1741

Discussed in https://github.com/stuart-lab/signac/discussions/1740