stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
325 stars 88 forks source link

RegionStats for Dm6 Chromosome Error #860

Closed alwhiteh closed 2 years ago

alwhiteh commented 2 years ago

Hi,

I'm working with a multiome single cell dataset from drosophila that was generated using the 10X platform. When following the multiome vignette, I realized that an R package for the dm6 ensemble genome did not exist (at least in a format that played nicely with the CreateChromatinAssay function). Therefore, I generated the annotation data using sitadela, and with some minor seqlevels formatting changes, we able to complete most of the vignette, with the exception of the RegionStats function. I get the following error:

Error in loadFUN(x, seqname, ranges) : trying to load regions beyond the boundaries of non-circular sequence "chr3L"

I'm not aware of a public dm6 dataset, but my code is attached.


`# Load packages
library(Seurat) #for RNA
library(hdf5r)
library(Signac) #for ATAC
library(dplyr)
library(Matrix)
library(scran)
library(ggplot2)
library(sctransform)
library(patchwork)
library(JASPAR2020)
library(TFBSTools)
library(renv)
library(BSgenome.Dmelanogaster.UCSC.dm6)
fly_genome <- BSgenome.Dmelanogaster.UCSC.dm6
library(chromVAR)
library(Rcpp)
library(sitadela) #we will use this to get the annotation for Dm6

# We will follow the Satija Lab vignette on multiomics from 10x: https://satijalab.org/signac/articles/pbmc_multiomic.html
# Load RNA and ATAC data
counts <- Read10X_h5("/Volumes/TUNEZ/Clara/filtered_feature_bc_matrix.h5")
fragpath <- "/Volumes/TUNEZ/Clara/atac_fragments.tsv.gz"

# get gene annotations for dm6
# We will use sitadela to build our own genome annotation object that includes biotype
buildDir <- file.path(tempdir(),"test_anndb")
dir.create(buildDir)

# The location of the custom database
myDb <- file.path(buildDir,"testann.sqlite")

# Make the annotation
fly_annotation <- loadAnnotation(genome="dm6",refdb="ensembl",type="gene",db=myDb)
fly_annotation # Here we can see what the GRanges object looks like
genome(fly_annotation) <- "dm6"
seqlevelsStyle(fly_annotation) <- "NCBI" # we need this to get rid of the "chr" before the chromosome number
fly_annotation$gene_biotype <- fly_annotation$biotype # This is the variable name that Signac requires

#Create object for RNA Data
fly <- CreateSeuratObject(
  counts = counts$`Gene Expression`,
  assay = "RNA")

# Now we can add the ATAC data to the seurat object
fly[["ATAC"]] <- CreateChromatinAssay(
  counts = counts$Peaks,
  sep = c(":", "-"),
  fragments = fragpath,
  annotation = fly_annotation
)

DefaultAssay(fly) <- "ATAC"

#calculate nucleosome signal and annotate
fly <- NucleosomeSignal(fly)
Annotation(fly) <- fly_annotation

# Calculate TSS Enrichment
fly <- TSSEnrichment(fly)

# filter out low quality cells
fly2 <- subset(
  x = fly,
  subset = nCount_ATAC < 10000 &
    nCount_RNA < 10000 &
    nCount_ATAC > 10 &
    nCount_RNA > 10 &
    nucleosome_signal < 0.9 &
    TSS.enrichment > 1
)
fly2

# call peaks using MACS2
peaks <- CallPeaks(fly2, macs2.path = "/Users/alexanderwhitehead/anaconda3/bin/macs2")

# remove peaks on nonstandard chromosomes and in genomic blacklist regions
peaks <- keepStandardChromosomes(peaks, pruning.mode = "coarse")

# We need to change the seq levels to match the blacklist formatting
# i.e. the blacklist uses "Chr1" rather than "1"
seqinfo(peaks)
seqlevelsStyle(peaks) <- "UCSC"
seqlevels(peaks)
peaks <- subsetByOverlaps(x = peaks, ranges = blacklist_dm6, invert = TRUE)
seqlevelsStyle(peaks) <- "NCBI"

# quantify counts in each peak
macs2_counts <- FeatureMatrix(
  fragments = Fragments(fly2),
  features = peaks,
  cells = colnames(fly2)
)

# create a new assay using the MACS2 peak set and add it to the Seurat object
fly2[["peaks"]] <- CreateChromatinAssay(
  counts = macs2_counts,
  fragments = fragpath,
  annotation = fly_annotation
)

# Let's start looking at the RNA data first
DefaultAssay(fly2) <- "RNA"
fly2 <- SCTransform(fly2)
fly2 <- RunPCA(fly2)

# Then find features for the ATAC data
DefaultAssay(fly2) <- "peaks"
fly2 <- FindTopFeatures(fly2, min.cutoff = 5)
fly2 <- RunTFIDF(fly2)
fly2 <- RunSVD(fly2)

DefaultAssay(fly2) <- "SCT"

fly2 <- FindMultiModalNeighbors(
  object = fly2,
  reduction.list = list("pca", "lsi"), 
  dims.list = list(1:50, 2:40),
  modality.weight.name = c("RNA.weight","ATAC.weight"),
  verbose = TRUE)

# build a joint UMAP visualization via RNA 
fly2 <- RunUMAP(
  object = fly2,
  nn.name = "weighted.nn",
  assay = "RNA",
  verbose = TRUE)

# We need a clustering step HERE for transformed RNA:
DefaultAssay(fly2) <- "SCT"
fly2 <- FindNeighbors(fly2, dims = 1:50)
#find clusters
fly2 <- FindClusters(fly2, resolution = 0.5)
fly2 <- FindClusters(fly2, resolution = 0.3)
fly2 <- RunUMAP(fly2, dims = 1:30)
DimPlot(fly2, label = TRUE)

fly2 <- RenameIdents(fly2, `0` = "0", `1` = "1", `2` = "2",`3`= "3", 
                         `4` = "Peri", `5` = "5", `6` = "6", `7` = "CM",
                         `8` = "8", `9` = "9", `10`="10")

fly_genome <- BSgenome.Dmelanogaster.UCSC.dm6
DefaultAssay(fly2) <- "peaks"
seqlevelsStyle(fly_genome) <- "NCBI"
seqinfo(fly_genome)
fly2 <- RegionStats(fly2[['peaks']], genome = fly_genome)`

I am running Mac OS 10.15.7 and my session info is:

R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] stringr_1.4.0                         BSgenome.Hsapiens.UCSC.hg38_1.4.4     EnsDb.Hsapiens.v86_2.99.0            
 [4] ensembldb_2.18.1                      AnnotationFilter_1.18.0               GenomicFeatures_1.46.1               
 [7] AnnotationDbi_1.56.1                  BSgenome.Dmelanogaster.UCSC.dm6_1.4.1 BSgenome_1.62.0                      
[10] rtracklayer_1.54.0                    Biostrings_2.62.0                     XVector_0.34.0                       
[13] sitadela_1.2.0                        Rcpp_1.0.7                            chromVAR_1.16.0                      
[16] renv_0.12.0-13                        TFBSTools_1.32.0                      JASPAR2020_0.99.10                   
[19] patchwork_1.1.1                       sctransform_0.3.2                     ggplot2_3.3.5                        
[22] scran_1.22.0                          scuttle_1.4.0                         SingleCellExperiment_1.16.0          
[25] SummarizedExperiment_1.24.0           Biobase_2.54.0                        GenomicRanges_1.46.0                 
[28] GenomeInfoDb_1.30.0                   IRanges_2.28.0                        S4Vectors_0.32.1                     
[31] BiocGenerics_0.40.0                   MatrixGenerics_1.6.0                  matrixStats_0.61.0                   
[34] Matrix_1.3-4                          dplyr_1.0.7                           Signac_1.4.0                         
[37] hdf5r_1.3.4                           SeuratObject_4.0.2                    Seurat_4.0.5                         

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3              SnowballC_0.7.0             scattermore_0.7             R.methodsS3_1.8.1          
  [5] tidyr_1.1.4                 knitr_1.36                  bit64_4.0.5                 irlba_2.3.3                
  [9] DelayedArray_0.20.0         R.utils_2.11.0              data.table_1.14.2           rpart_4.1-15               
 [13] KEGGREST_1.34.0             RCurl_1.98-1.5              generics_0.1.1              ScaledMatrix_1.2.0         
 [17] cowplot_1.1.1               RSQLite_2.2.8               RANN_2.6.1                  future_1.23.0              
 [21] bit_4.0.4                   tzdb_0.2.0                  spatstat.data_2.1-0         xml2_1.3.2                 
 [25] httpuv_1.6.3                assertthat_0.2.1            DirichletMultinomial_1.36.0 xfun_0.28                  
 [29] hms_1.1.1                   promises_1.2.0.1            fansi_0.5.0                 restfulr_0.0.13            
 [33] progress_1.2.2              caTools_1.18.2              dbplyr_2.1.1                igraph_1.2.7               
 [37] DBI_1.1.1                   htmlwidgets_1.5.4           sparsesvd_0.2               spatstat.geom_2.3-0        
 [41] purrr_0.3.4                 ellipsis_0.3.2              RSpectra_0.16-0             backports_1.3.0            
 [45] annotate_1.72.0             biomaRt_2.50.0              deldir_1.0-6                sparseMatrixStats_1.6.0    
 [49] vctrs_0.3.8                 ROCR_1.0-11                 abind_1.4-5                 cachem_1.0.6               
 [53] withr_2.4.2                 ggforce_0.3.3               checkmate_2.0.0             GenomicAlignments_1.30.0   
 [57] prettyunits_1.1.1           goftest_1.2-3               cluster_2.1.2               lazyeval_0.2.2             
 [61] seqLogo_1.60.0              crayon_1.4.2                edgeR_3.36.0                pkgconfig_2.0.3            
 [65] slam_0.1-48                 labeling_0.4.2              tweenr_1.0.2                ProtGenerics_1.26.0        
 [69] nlme_3.1-153                nnet_7.3-16                 gitcreds_0.1.1              rlang_0.4.12               
 [73] globals_0.14.0              lifecycle_1.0.1             miniUI_0.1.1.1              filelock_1.0.2             
 [77] BiocFileCache_2.2.0         rsvd_1.0.5                  dichromat_2.0-0             polyclip_1.10-0            
 [81] lmtest_0.9-39               ggseqlogo_0.1               zoo_1.8-9                   base64enc_0.1-3            
 [85] ggridges_0.5.3              png_0.1-7                   viridisLite_0.4.0           rjson_0.2.20               
 [89] bitops_1.0-7                R.oo_1.24.0                 KernSmooth_2.23-20          blob_1.2.2                 
 [93] DelayedMatrixStats_1.16.0   parallelly_1.28.1           jpeg_0.1-9                  readr_2.0.2                
 [97] CNEr_1.30.0                 beachmat_2.10.0             scales_1.1.1                memoise_2.0.0              
[101] magrittr_2.0.1              plyr_1.8.6                  ica_1.0-2                   zlibbioc_1.40.0            
[105] compiler_4.1.2              dqrng_0.3.0                 BiocIO_1.4.0                RColorBrewer_1.1-2         
[109] fitdistrplus_1.1-6          cli_3.1.0                   Rsamtools_2.10.0            listenv_0.8.0              
[113] pbapply_1.5-0               htmlTable_2.3.0             Formula_1.2-4               MASS_7.3-54                
[117] mgcv_1.8-38                 tidyselect_1.1.1            stringi_1.7.5               yaml_2.2.1                 
[121] BiocSingular_1.10.0         locfit_1.5-9.4              latticeExtra_0.6-29         ggrepel_0.9.1              
[125] grid_4.1.2                  VariantAnnotation_1.40.0    fastmatch_1.1-3             tools_4.1.2                
[129] future.apply_1.8.1          parallel_4.1.2              rstudioapi_0.13             TFMPvalue_0.0.8            
[133] foreign_0.8-81              bluster_1.4.0               lsa_0.73.2                  metapod_1.2.0              
[137] gridExtra_2.3               farver_2.1.0                Rtsne_0.15                  digest_0.6.28              
[141] BiocManager_1.30.16         shiny_1.7.1                 pracma_2.3.3                qlcMatrix_0.9.7            
[145] later_1.3.0                 RcppAnnoy_0.0.19            httr_1.4.2                  biovizBase_1.42.0          
[149] colorspace_2.0-2            XML_3.99-0.8                tensor_1.5                  reticulate_1.22            
[153] splines_4.1.2               uwot_0.1.10                 RcppRoll_0.3.0              statmod_1.4.36             
[157] spatstat.utils_2.2-0        plotly_4.10.0               xtable_1.8-4                jsonlite_1.7.2             
[161] poweRlaw_0.70.6             R6_2.5.1                    Hmisc_4.6-0                 pillar_1.6.4               
[165] htmltools_0.5.2             mime_0.12                   glue_1.5.0                  fastmap_1.1.0              
[169] DT_0.19                     BiocParallel_1.28.0         BiocNeighbors_1.12.0        codetools_0.2-18           
[173] utf8_1.2.2                  lattice_0.20-45             spatstat.sparse_2.0-0       tibble_3.1.5               
[177] curl_4.3.2                  leiden_0.3.9                gtools_3.9.2                GO.db_3.14.0               
[181] survival_3.2-13             limma_3.50.0                docopt_0.7.1                munsell_0.5.0              
[185] GenomeInfoDbData_1.2.7      reshape2_1.4.4              gtable_0.3.0                spatstat.core_2.3-1   
timoast commented 2 years ago

Can you show the output of:

gr <- granges(ly2[['peaks']])
gr <- gr[seqnames(gr) == "chr3L"]
max(end(gr))
alwhiteh commented 2 years ago

Hi Tim,

I got an error with that code:

max(end(gr))
[1] -Inf
Warning message:
In max(end(gr)) : no non-missing arguments to max; returning -Inf

However, when I used "3L" instead of "chr3L" I got:

> max(end(gr))
[1] 28110291

Both my peaks and genome file should have similarly formatted chromosome names:

> head(gr)
GRanges object with 6 ranges and 0 metadata columns:
      seqnames      ranges strand
         <Rle>   <IRanges>  <Rle>
  [1]       3L 32221-35564      *
  [2]       3L 39830-40723      *
  [3]       3L 41776-42611      *
  [4]       3L 57607-58043      *
  [5]       3L 73603-73803      *
  [6]       3L 85160-85561      *

and

> head(seqlevels(fly_genome))
[1] "2L" "2R" "3L" "3R" "4"  "X" 

I did, however, notice in the fly_genome object that seqinfo -> seqnames contains the "Chr2L" formatting, while user_seqnames contains the "2L" formatting. Is it possible that the command is using the former part of the genome object?

timoast commented 2 years ago

@alwhiteh are your coordinates 0-based? If so can you try the suggestions here: https://github.com/timoast/signac/issues/864

timoast commented 2 years ago

If not could you send me the granges object for your assay? You can save with the following code:

gr <- granges(ly2[['peaks']])
saveRDS(gr, "granges_signac.rds")

and email to tstuart@nygenome.org

timoast commented 2 years ago

Hi @alwhiteh

I should have realized earlier, but you indeed have peaks that extend beyond the end of chromosome 3L:

> end(fly_genome$`3L`)
[1] 28110227        1

> max(end(gr[seqnames(gr) == "3L"]))
[1] 28110291

I'd suggest double-checking the version of the genome used to map the data