satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.27k stars 910 forks source link

subset() changing values in the nFeature_RNA and nCount_RNA to zero #7299

Closed nikita-telkar closed 3 months ago

nikita-telkar commented 1 year ago

Hello,

I've been re-analysing my data and trying to subset my object to keep nFeature >= 500. But after subsetting, the meta.data shows changes in the values in the nFeature_RNA and nCount_RNA. My previous iteration of the analysis was in the earlier version of Seurat, so I'm updating the script using the newer 4.3.0 version.

  1. This is how the meta.data of the original object (named hs) looks like - note that I've sorted the column by nFeature_RNA, and that the lowest count within the column is 100. Total cells = 54,450

Screenshot 2023-05-08 at 6 20 47 PM

  1. When I run either of the following:

filtered_qc <- subset(hs, nFeature_RNA >= 500) filtered_qc <- subset(hs, subset = nFeature_RNA >= 500)

This is what I get - note that sorting the column based on nFeature_RNA, the lowest count is zero. Total cells = 47,269

Screenshot 2023-05-08 at 6 22 33 PM

  1. Note the values for the same two cells:

Original dataframe: hs Screenshot 2023-05-08 at 6 43 56 PM

Filtered dataframe: filtered_qc
Screenshot 2023-05-08 at 6 43 20 PM

  1. If I run either of the two commands again, I get the correct subsetting (n = ~7k cells). Seems somewhat similar to https://github.com/satijalab/seurat/issues/2292

Am I missing something within the manual that states that the values change?

Thank you very much in advance!

yuhanH commented 1 year ago

hi @nikita-telkar Could you show us your sessionInfo()? Also, Do you meet this situation for pbmc3k? For example,

> library(Seurat)
> library(SeuratData)
> obj <- LoadData(ds = 'pbmc3k')
> head(obj$nFeature_RNA)
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC 
           779           1352           1129            960            521            781 
> 
> obj.filter <- subset(obj, subset = nFeature_RNA > 500)
> head(obj.filter$nFeature_RNA)
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC 
           779           1352           1129            960            521            781 
nikita-telkar commented 1 year ago

@yuhanH

  1. SessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2        lubridate_1.9.2        
 [4] forcats_1.0.0           stringr_1.5.0           dplyr_1.1.2            
 [7] purrr_1.0.1             readr_2.1.4             tidyr_1.3.0            
[10] tibble_3.2.1            tidyverse_2.0.0         ggplot2_3.4.2          
[13] ggthemes_4.2.4          kableExtra_1.3.4        plomics_0.2.0          
[16] BiocManager_1.30.20     circlize_0.4.15         RColorBrewer_1.1-3     
[19] scales_1.2.1            gridExtra_2.3           data.table_1.14.8      
[22] rmarkdown_2.21          pheatmap_1.0.12         arsenal_3.6.3          
[25] openxlsx_4.2.5.2        readxl_1.4.2            here_1.0.1             
[28] SeuratObject_4.1.3      Seurat_4.3.0            knitr_1.42             
[31] formatR_1.14           

loaded via a namespace (and not attached):
  [1] systemfonts_1.0.4      plyr_1.8.8             igraph_1.4.2          
  [4] lazyeval_0.2.2         sp_1.6-0               splines_4.2.2         
  [7] listenv_0.9.0          scattermore_0.8        usethis_2.1.6         
 [10] digest_0.6.31          htmltools_0.5.5        fansi_1.0.4           
 [13] memoise_2.0.1          magrittr_2.0.3         tensor_1.5            
 [16] cluster_2.1.4          ROCR_1.0-11            remotes_2.4.2         
 [19] tzdb_0.3.0             globals_0.16.2         matrixStats_0.63.0    
 [22] svglite_2.1.1          timechange_0.2.0       spatstat.sparse_3.0-1 
 [25] prettyunits_1.1.1      colorspace_2.1-0       rappdirs_0.3.3        
 [28] rvest_1.0.3            ggrepel_0.9.3          xfun_0.39             
 [31] crayon_1.5.2           callr_3.7.3            jsonlite_1.8.4        
 [34] progressr_0.13.0       spatstat.data_3.0-1    survival_3.5-5        
 [37] zoo_1.8-12             glue_1.6.2             polyclip_1.10-4       
 [40] gtable_0.3.3           webshot_0.5.4          leiden_0.4.3          
 [43] pkgbuild_1.4.0         future.apply_1.10.0    shape_1.4.6           
 [46] abind_1.4-5            DBI_1.1.3              spatstat.random_3.1-4 
 [49] miniUI_0.1.1.1         Rcpp_1.0.10            viridisLite_0.4.1     
 [52] xtable_1.8-4           reticulate_1.28        profvis_0.3.8         
 [55] htmlwidgets_1.6.2      httr_1.4.5             ellipsis_0.3.2        
 [58] ica_1.0-3              urlchecker_1.0.1       pkgconfig_2.0.3       
 [61] uwot_0.1.14            deldir_1.0-6           utf8_1.2.3            
 [64] tidyselect_1.2.0       rlang_1.1.1            reshape2_1.4.4        
 [67] later_1.3.0            cachem_1.0.8           munsell_0.5.0         
 [70] cellranger_1.1.0       tools_4.2.2            cli_3.6.1             
 [73] generics_0.1.3         devtools_2.4.5         ggridges_0.5.4        
 [76] evaluate_0.20          fastmap_1.1.1          yaml_2.3.7            
 [79] goftest_1.2-3          processx_3.8.1         fs_1.6.2              
 [82] fitdistrplus_1.1-11    zip_2.3.0              RANN_2.6.1            
 [85] egg_0.4.5              pbapply_1.7-0          future_1.32.0         
 [88] nlme_3.1-162           mime_0.12              xml2_1.3.4            
 [91] compiler_4.2.2         rstudioapi_0.14        curl_5.0.0            
 [94] plotly_4.10.1          png_0.1-8              spatstat.utils_3.0-2  
 [97] stringi_1.7.12         ps_1.7.5               desc_1.4.2            
[100] lattice_0.21-8         Matrix_1.5-4           vctrs_0.6.2           
[103] pillar_1.9.0           lifecycle_1.0.3        spatstat.geom_3.1-0   
[106] lmtest_0.9-40          GlobalOptions_0.1.2    RcppAnnoy_0.0.20      
[109] cowplot_1.1.1          irlba_2.3.5.1          httpuv_1.6.9          
[112] patchwork_1.1.2        R6_2.5.1               promises_1.2.0.1      
[115] KernSmooth_2.23-20     parallelly_1.35.0      sessioninfo_1.2.2     
[118] codetools_0.2-19       pkgload_1.3.2          MASS_7.3-59           
[121] rprojroot_2.0.3        withr_2.5.0            sctransform_0.3.5     
[124] parallel_4.2.2         hms_1.1.3              grid_4.2.2            
[127] Rtsne_0.16             spatstat.explore_3.1-0 shiny_1.7.4           
  1. It seems to be working for the pbmc dataset
install.packages("devtools")
devtools::install_github("satijalab/seurat-data")
library(SeuratData)

SeuratData::InstallData("pbmc3k")

obj <- SeuratData::LoadData(ds = "pbmc3k")
head(obj$nFeature_RNA)
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC 
           779           1352           1129            960            521            781 

> obj@meta.data %>% 
+     dplyr::arrange(nFeature_RNA) %>% 
+     head()
               orig.ident nCount_RNA nFeature_RNA seurat_annotations
GGCATATGGGGAGT     pbmc3k        575          212           Platelet
CTAATAGAGCTATG     pbmc3k        600          239        Naive CD4 T
ATCTCAACCTCGAA     pbmc3k        609          246        Naive CD4 T
GACGCTCTCTCTCG     pbmc3k        682          266           Platelet
ACTTTGTGCGATAC     pbmc3k        600          267               <NA>
TAGTCTTGGCTGTA     pbmc3k        652          270        Naive CD4 T

# filtered
obj.filter <- subset(obj, subset = nFeature_RNA > 500)
head(obj.filter$nFeature_RNA)

AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG AAACGCACTGGTAC 
           779           1352           1129            960            521            781 

> obj.filter@meta.data %>% 
+     dplyr::arrange(nFeature_RNA) %>% 
+     head()
               orig.ident nCount_RNA nFeature_RNA seurat_annotations
GTGATGACCTGAGT     pbmc3k       1095          502                  B
AGTAATACCGAACT     pbmc3k       1236          503                  B
GGGATGGACGACAT     pbmc3k       1307          503                  B
TGGTAGTGCACTGA     pbmc3k       1110          503         CD14+ Mono
ACCACCTGTGTGCA     pbmc3k       1176          505        Naive CD4 T
CGTCAAGAAAGGTA     pbmc3k       1111          505         CD14+ Mono

Not sure what's happening

  1. I ran the subsetting on my raw merged object again (which contains mouse and human cells), and now running the code again isn't working either

all_counts <- merged_seurat@assays$RNA@counts

head(all_counts)

hs_counts <- CollapseSpeciesExpressionMatrix(
  all_counts,
  prefix = "GRCh38-",
  controls = "mm10---",
  ncontrols = 0
)

hs_genes <- hs_counts@Dimnames[[1]]
hs_genes <- as.data.frame(hs_genes)
colnames(hs_genes) <- "human_genes"

hs <- CreateSeuratObject(hs_counts, meta.data = merged_seurat@meta.data)

Screenshot 2023-05-23 at 10 03 26 AM


hs.filter <- subset(hs, subset = nFeature_RNA > 500)

hs.filter@meta.data %>% 
  dplyr::arrange(nFeature_RNA) %>% 
  head()

                      orig.ident nCount_RNA nFeature_RNA
H3_AAAGAACGTCCCACGA-1    H3-1207          0            0
H3_ACCAACAAGTTCCATG-1    H3-1207          0            0
H3_ACGATCACAGGCTTGC-1    H3-1207          0            0
H3_ACGTCCTCACTCTGCT-1    H3-1207          0            0
H3_ACTGCAACAACCGATT-1    H3-1207          0            0
H3_AGAAGCGTCGGAGTAG-1    H3-1207          0            0

Screenshot 2023-05-23 at 10 02 34 AM

Thank you!

nikita-telkar commented 1 year ago

Hi @yuhanH , just wanted to check if there might have been any updates since I last posted in May? :)

Thanks, Nikita

yuhanH commented 1 year ago

hi @nikita-telkar Sorry for the delay. Could you send me a small reproducible example? That will be much easier for me to figure out this issue. email: yhao@nygenome.org

stefanonard85 commented 1 year ago

Hello Seurat team,

I am facing the same problem using Seurat5, but it looks like it stems from the function "split", not "subset". I have only realized it now because all the code worked fine, but then I discovered the RNA dimnames slots are empty, whereas the matrix is populated, strange indeed. On the contrary, the SCT matrix has both dimnames and matrix values, as expected. Therefore, either there is a bug, or there is a sort of function in Seurat5 that relates the two, but still missing is the dimnames. Or, probably, there is a mistake on my end.

I have tested the function subset, but it does not seem to originate from it. On the contrary, after running the function split, the "dimnames" slot results empty. One consideration is that I am starting from a seurat object v3. Even if I update it, running UpdateSeuratObject() shows a warning; here is the part where slots are wrongly converted from v3 to v5, I guess.

One solution would be to export both matrix and metadata and generate a new object. I guess.

Screenshot 2023-10-04 at 1 21 19 PM

My code: DefaultAssay(SIM1) <- "RNA" SIM1[["RNA"]] <- split(SIM1[["RNA"]], f = SIM1$sample) covs <- c("mt.percent") SIM1 <- SCTransform(SIM1, vst.flavor = "v2", verbose=T, ncells =length(rownames(SIM1@meta.data)) , vars.to.regress = covs) SIM1 <- RunPCA(SIM1, npcs = 50, verbose = T) SIM1 <- IntegrateLayers(object = SIM1, method = HarmonyIntegration, orig.reduction = "pca", new.reduction = "harmony", verbose = T) SIM1 <- FindNeighbors(SIM1, dims = 1:50, reduction = "harmony") SIM1 <- FindClusters(SIM1, resolution = 0.6) SIM1 <- RunUMAP(SIM1, dims = 1:50, reduction = "harmony", reduction.name = "umap") SIM1[["RNA"]] <- JoinLayers(SIM1[["RNA"]], f = SIM1$sample) UMAPPlot(SIM1, label=T, pt.size=0)

sessionInfo() R version 4.3.1 (2023-06-16) Platform: x86_64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.5

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Rome tzcode source: internal

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] glmGamPoi_1.12.2 patchwork_1.1.3 ggplot2_3.4.3 Seurat_4.9.9.9067
[5] SeuratObject_4.9.9.9091 sp_2.1-0

loaded via a namespace (and not attached): [1] RColorBrewer_1.1-3 rstudioapi_0.15.0 jsonlite_1.8.7
[4] magrittr_2.0.3 spatstat.utils_3.0-3 zlibbioc_1.46.0
[7] vctrs_0.6.3 ROCR_1.0-11 spatstat.explore_3.2-3
[10] RCurl_1.98-1.12 S4Arrays_1.0.6 htmltools_0.5.6
[13] sctransform_0.4.0 parallelly_1.36.0 KernSmooth_2.23-22
[16] htmlwidgets_1.6.2 ica_1.0-3 plyr_1.8.9
[19] plotly_4.10.2 zoo_1.8-12 igraph_1.5.1
[22] mime_0.12 lifecycle_1.0.3 pkgconfig_2.0.3
[25] Matrix_1.6-1.1 R6_2.5.1 fastmap_1.1.1
[28] GenomeInfoDbData_1.2.10 MatrixGenerics_1.12.3 fitdistrplus_1.1-11
[31] future_1.33.0 shiny_1.7.5 digest_0.6.33
[34] colorspace_2.1-0 S4Vectors_0.38.2 tensor_1.5
[37] RSpectra_0.16-1 irlba_2.3.5.1 GenomicRanges_1.52.0
[40] progressr_0.14.0 fansi_1.0.4 spatstat.sparse_3.0-2
[43] httr_1.4.7 polyclip_1.10-6 abind_1.4-5
[46] compiler_4.3.1 remotes_2.4.2.1 withr_2.5.1
[49] fastDummies_1.7.3 MASS_7.3-60 DelayedArray_0.26.7
[52] tools_4.3.1 lmtest_0.9-40 httpuv_1.6.11
[55] future.apply_1.11.0 goftest_1.2-3 glue_1.6.2
[58] nlme_3.1-163 promises_1.2.1 grid_4.3.1
[61] Rtsne_0.16 cluster_2.1.4 reshape2_1.4.4
[64] generics_0.1.3 gtable_0.3.4 spatstat.data_3.0-1
[67] tidyr_1.3.0 data.table_1.14.8 utf8_1.2.3
[70] XVector_0.40.0 BiocGenerics_0.46.0 BPCells_0.1.0
[73] spatstat.geom_3.2-5 RcppAnnoy_0.0.21 ggrepel_0.9.3
[76] RANN_2.6.1 pillar_1.9.0 stringr_1.5.0
[79] spam_2.9-1 RcppHNSW_0.5.0 later_1.3.1
[82] splines_4.3.1 dplyr_1.1.3 lattice_0.21-9
[85] survival_3.5-7 deldir_1.0-9 tidyselect_1.2.0
[88] miniUI_0.1.1.1 pbapply_1.7-2 gridExtra_2.3
[91] IRanges_2.34.1 SummarizedExperiment_1.30.2 scattermore_1.2
[94] stats4_4.3.1 Biobase_2.60.0 matrixStats_1.0.0
[97] stringi_1.7.12 lazyeval_0.2.2 codetools_0.2-19
[100] tibble_3.2.1 cli_3.6.1 uwot_0.1.16
[103] xtable_1.8-4 reticulate_1.32.0 munsell_0.5.0
[106] Rcpp_1.0.11 GenomeInfoDb_1.36.3 globals_0.16.2
[109] spatstat.random_3.1-6 png_0.1-8 parallel_4.3.1
[112] ellipsis_0.3.2 dotCall64_1.0-2 bitops_1.0-7
[115] listenv_0.9.0 viridisLite_0.4.2 scales_1.2.1
[118] ggridges_0.5.4 crayon_1.5.2 leiden_0.4.3
[121] purrr_1.0.2 rlang_1.1.1 cowplot_1.1.1

mhkowalski commented 3 months ago

Hi,

Thank you all for your thoughts.

We haven't heard back about the initial issue and I wasn't able to reproduce this either, so I am closing now. Please open a new issue if you are continuing to have problems with the subset function.

@stefanonard85 this seems like a different problem, but it looks like the expected behavior of the new v5 assay. I'd recommend opening a new issue explaining the problem you're running into.