stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
327 stars 88 forks source link

Error when calling GeneActivity for atac data #1551

Closed YaoLi3 closed 10 months ago

YaoLi3 commented 10 months ago

Dear Signac team,

Hi, I'm trying to perform label transferring between ATAC-seq data and annotated scRNA-seq data. However, I keep getting this error when trying to call GeneActivity on this.

> GeneActivity(data.atac, features = VariableFeatures(data.rna))
> Error in `[[<-`(`*tmp*`, name, value = logical(0)) : 
      0 elements in value to replace 22094 elements

I don't understand the meaning of this error message. Does this mean the feature names (gene names of RNA data) can not match the gene names in atac data? Gene names only exist in the annotation of atac data object by the way.

I have tried to ensure the cell and gene naming convention is consistent between the RNA data and ATAC data, but I'm still getting the same error.

Here is my gff annotation:

GRanges object with 22094 ranges and 8 metadata columns:
               seqnames        ranges strand |       source     type     score
                  <Rle>     <IRanges>  <Rle> |     <factor> <factor> <numeric>
      [1]     Hic-chr-1   82483-83603      + | transdecoder     gene        NA
      [2]     Hic-chr-1   83839-86863      - | transdecoder     gene        NA
      [3]     Hic-chr-1   88932-91323      - | transdecoder     gene        NA
      [4]     Hic-chr-1 275836-296219      - | transdecoder     gene        NA
      [5]     Hic-chr-1 299286-300170      - | transdecoder     gene        NA
      ...           ...           ...    ... .          ...      ...       ...
  [22090] scaffold-1942      918-3779      + | transdecoder     gene        NA
  [22091] scaffold-1945   17153-23919      - | transdecoder     gene        NA
  [22092] scaffold-1946   12456-13017      - | transdecoder     gene        NA
  [22093] scaffold-1947   45557-54662      + | transdecoder     gene        NA
  [22094] scaffold-1948   10766-22313      - | transdecoder     gene        NA
              phase           gene_id                ID                   Name
          <integer>       <character>       <character>            <character>
      [1]      <NA>           MSTRG.1           MSTRG.1 ORF type:5prime_part..
      [2]      <NA>           MSTRG.2           MSTRG.2 ORF type:5prime_part..
      [3]      <NA>           MSTRG.3           MSTRG.3 ORF type:complete le..
      [4]      <NA>    nbisL1-mrna-61    nbisL1-mrna-61 ORF type:complete le..
      [5]      <NA>           MSTRG.9           MSTRG.9 ORF type:complete le..
      ...       ...               ...               ...                    ...
  [22090]      <NA>       MSTRG.31802       MSTRG.31802 ORF type:complete le..
  [22091]      <NA> nbisL1-mrna-19027 nbisL1-mrna-19027 ORF type:complete le..
  [22092]      <NA>       MSTRG.31804       MSTRG.31804 ORF type:internal le..
  [22093]      <NA> nbisL1-mrna-19028 nbisL1-mrna-19028 ORF type:5prime_part..
  [22094]      <NA> nbisL1-mrna-19031 nbisL1-mrna-19031 ORF type:complete le..
                   Parent
          <CharacterList>
      [1]                
      [2]                
      [3]                
      [4]                
      [5]                
      ...             ...
  [22090]                
  [22091]                
  [22092]                
  [22093]                
  [22094]                
  -------
  seqinfo: 858 sequences from an unspecified genome; no seqlengths

And this is my session info:

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /dellfsqd3/MGI_QINGDAO/USER/lishuangshuang/software/miniconda3/envs/scATAC/lib/libopenblasp-r0.3.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] stringr_1.4.1        sp_1.5-0             SeuratObject_4.1.0  
 [4] Seurat_4.1.1         GenomicRanges_1.46.1 Signac_1.7.0        
 [7] GenomeInfoDb_1.30.1  IRanges_2.28.0       S4Vectors_0.32.4    
[10] BiocGenerics_0.40.0 

loaded via a namespace (and not attached):
  [1] Rtsne_0.16                  colorspace_2.0-3           
  [3] deldir_1.0-6                rjson_0.2.21               
  [5] ellipsis_0.3.2              ggridges_0.5.3             
  [7] XVector_0.34.0              spatstat.data_2.2-0        
  [9] leiden_0.4.2                listenv_0.8.0              
 [11] ggrepel_0.9.1               fansi_1.0.3                
 [13] codetools_0.2-18            splines_4.1.2              
 [15] polyclip_1.10-0             RcppRoll_0.3.0             
 [17] jsonlite_1.8.3              Rsamtools_2.10.0           
 [19] ica_1.0-2                   cluster_2.1.3              
 [21] png_0.1-7                   rgeos_0.5-8                
 [23] uwot_0.1.11                 spatstat.sparse_2.1-1      
 [25] sctransform_0.3.3           shiny_1.7.2                
 [27] compiler_4.1.2              httr_1.4.4                 
 [29] Matrix_1.4-1                fastmap_1.1.0              
 [31] lazyeval_0.2.2              cli_3.4.1                  
 [33] later_1.3.0                 htmltools_0.5.3            
 [35] tools_4.1.2                 igraph_1.3.2               
 [37] gtable_0.3.1                glue_1.6.2                 
 [39] GenomeInfoDbData_1.2.7      reshape2_1.4.4             
 [41] RANN_2.6.1                  dplyr_1.0.10               
 [43] fastmatch_1.1-3             Rcpp_1.0.9                 
 [45] scattermore_0.8             Biobase_2.54.0             
 [47] vctrs_0.4.2                 Biostrings_2.62.0          
 [49] nlme_3.1-158                rtracklayer_1.54.0         
 [51] progressr_0.10.1            lmtest_0.9-40              
 [53] spatstat.random_2.2-0       globals_0.15.1             
 [55] mime_0.12                   miniUI_0.1.1.1             
 [57] lifecycle_1.0.3             irlba_2.3.5                
 [59] restfulr_0.0.15             goftest_1.2-3              
 [61] XML_3.99-0.9                future_1.26.1              
 [63] zlibbioc_1.40.0             MASS_7.3-57                
 [65] zoo_1.8-10                  scales_1.2.1               
 [67] spatstat.core_2.4-4         spatstat.utils_2.3-1       
 [69] promises_1.2.0.1            MatrixGenerics_1.6.0       
 [71] parallel_4.1.2              SummarizedExperiment_1.24.0
 [73] RColorBrewer_1.1-3          yaml_2.3.6                 
 [75] gridExtra_2.3               reticulate_1.25            
 [77] pbapply_1.5-0               ggplot2_3.3.6              
 [79] rpart_4.1.16                stringi_1.7.6              
 [81] BiocIO_1.4.0                BiocParallel_1.28.3        
 [83] rlang_1.0.6                 pkgconfig_2.0.3            
 [85] matrixStats_0.62.0          bitops_1.0-7               
 [87] lattice_0.20-45             tensor_1.5                 
 [89] ROCR_1.0-11                 purrr_0.3.5                
 [91] GenomicAlignments_1.30.0    patchwork_1.1.1            
 [93] htmlwidgets_1.5.4           cowplot_1.1.1              
 [95] tidyselect_1.2.0            parallelly_1.32.0          
 [97] RcppAnnoy_0.0.19            plyr_1.8.7                 
 [99] magrittr_2.0.3              R6_2.5.1                   
[101] generics_0.1.3              DelayedArray_0.20.0        
[103] withr_2.5.0                 mgcv_1.8-40                
[105] pillar_1.8.1                fitdistrplus_1.1-8         
[107] abind_1.4-5                 survival_3.3-1             
[109] RCurl_1.98-1.7              tibble_3.1.8               
[111] future.apply_1.9.0          crayon_1.5.2               
[113] KernSmooth_2.23-20          utf8_1.2.2                 
[115] spatstat.geom_2.4-0         plotly_4.10.0              
[117] grid_4.1.2                  data.table_1.14.4          
[119] digest_0.6.30               xtable_1.8-4               
[121] tidyr_1.2.1                 httpuv_1.6.6               
[123] munsell_0.5.0               viridisLite_0.4.1     

Thanks in advance!

Best, Yao

timoast commented 10 months ago

Hi, can you show the full code you're running, and the output of head(VariableFeatures(data.rna))?

Could you also try updating to the latest Signac release?

YaoLi3 commented 10 months ago

the output of head(VariableFeatures(data.rna)) is:

> head(VariableFeatures(data.rna))
[1] "nbisL1-mrna-12678" "MSTRG.16655"       "nbisL1-mrna-2064" 
[4] "nbisL1-mrna-16480" "MSTRG.15682"       "nbisL1-mrna-4839" 

the full code I'm running:

library(Seurat)
library(Signac)
library(Matrix)

# load scRNA data
fn <- 'liver.rds'
data.rna <- readRDS(fn)

# load scATAC data
load_atac <- function(directory_path){
    barcode_fn <- file.path(directory_path, "barcodes.tsv")
    peaks_fn <- file.path(directory_path, "peak.bed")
    matrix_fn<- file.path(directory_path, "matrix.mtx")

    # Load the data files
    barcodes <- read.table(barcode_fn, sep = "\t", header = FALSE)
    peaks <- read.table(peaks_fn, sep = "\t", check.names=FALSE)
    frag_matrix <- readMM(matrix_fn)

    # Set the row names for the matrix from the peaks object
    rownames(frag_matrix) <- peaks$V1

    # Set the column names for the matrix
    colnames(frag_matrix) <- barcodes$V1

    return (frag_matrix)
}
dir2 <- 'liver-2_1'
peaks <- load_atac(dir2)

# 2. chromain assay
metadata <- read.table('Metadata.tsv',sep='\t',header =TRUE)

frag_fn <- 'liver-2_1.fragments.tsv.gz'

chrom_assay <- CreateChromatinAssay(
  counts = peaks,
  sep = c(":", "-","_"),
  fragments = frag_fn,
  min.cells = 10,
  min.features = 200
)

data.atac <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = metadata
)

data.atac$tech <- 'atac'

# preprocessing
data.atac <- FindVariableFeatures(data.atac)#, assay = "ACTIVITY", selection.method = "vst", nfeatures = 2000)
data.atac <- NormalizeData(data.atac)
data.atac <- ScaleData(data.atac)

# ATAC analysis add gene annotation information
gff_fn = 'final.gff3'
gff = rtracklayer::import.gff(gff_fn)
gene_gff = gff[gff$type=='gene']
Annotation(data.atac) <- gene_gff

# Dimension Reduction of the peak matrix via Latent Semantic Indexing
# To study the internal structure of the scATAC-seq data, making anchor detection possible
VariableFeatures(data.atac) <- names(which(Matrix::rowSums(data.atac) > 100))
# We exclude the first dimension as this is typically correlated with sequencing depth
data.atac <- RunTFIDF(data.atac)
data.atac <- FindTopFeatures(data.atac, min.cutoff = "q0")
data.atac <- RunSVD(data.atac)
data.atac <- RunUMAP(data.atac, reduction = "lsi", dims = 2:30, reduction.name = "umap.atac", reduction.key = "atacUMAP_")

# Identifying anchors between scRNA-seq and scATAC-seq datasets
# transfer label
# quantify gene activity
gene.activities <- GeneActivity(data.atac, features = VariableFeatures(data.rna))

As for updating the Signac package, I'm afraid this is going to be difficult. I had a hard time installing the package on the Linux server my institute provided, so I'm currently using a R miniconda environment my colleague kindly gave to me. It's not very likely to get her to update it for me. Do I need to do this?

timoast commented 10 months ago

I think the issue here is you don't have biotypes information in the gene annotations that you're using. If you add a column like this:

gene_gff$biotype <- "protein_coding"

can you see if that fixes the problem?

YaoLi3 commented 10 months ago
> gene_gff$biotype <- "protein_coding"
> Annotation(data.atac) = gene_gff
> gene.activities <- GeneActivity(data.atac, features = VariableFeatures(data.rna))
Error in `[[<-`(`*tmp*`, name, value = logical(0)) : 
  0 elements in value to replace 22094 elements

tried this, still getting the same error

timoast commented 10 months ago

sorry it should be gene_biotype not biotype. Try:

gene_gff$gene_biotype <- "protein_coding"
YaoLi3 commented 10 months ago

sorry but still... Really don't understand where this error is from

Error in `[[<-`(`*tmp*`, name, value = logical(0)) : 
  0 elements in value to replace 22094 elements
timoast commented 10 months ago

You also need to add gene_name information to your annotations since you're setting the features parameter in GeneActivity().

gene_gff$gene_name <- gene_gff$gene_id

I think this should resolve the issue, but if not please take a look at the example gene annotations included in the atac_small dataset in the Signac package (Annotation(atac_small)), and ensure that the annotations you're using matches the format of those annotations. If they do match and you think this is a bug in Signac, please reopen this issue.

YaoLi3 commented 10 months ago

yes, you are right, adding gene_name column into my annotations solved the issue. I falsely assumed GeneActivity() only uses gene_id information.

Thanks again your help!