satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.25k stars 904 forks source link

base::pmatch() used in FetchData() may cause some problems when vars have duplicates #2882

Closed liu-xingliang closed 1 year ago

liu-xingliang commented 4 years ago

My R and Seurat session info:

Seurat_3.1.4.9020
R: 3.6.3

The wired behaviour of base::pmatch():

# 1. x contains duplicates
# 2. x and table are same
# 3. length(x) > 100
> pmatch(x = c(rep("TP53", 99), "CD4"), table = c(rep("TP53", 99), "CD4"))
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100
> pmatch(x = c(rep("TP53", 100), "CD4"), table = c(rep("TP53", 100), "CD4"))
  [1]   1  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 [19]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 [37]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 [55]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 [73]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
 [91]  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 101

Usage of base::pmatch() in FetchData():

https://github.com/satijalab/seurat/blob/b51801bc4b1a66aed5456473c9fe0be884994c93/R/objects.R#L1110-L1113

This may cause an error when vars argument of FetchData() contains more than 100 elements (usually when user want to retrieve expression value from default assay) and it contains duplicates.

> dim(FetchData(data, vars=c(rep("TP53", 99), "CD4")))
[1] 2700  100
> dim(FetchData(data, vars=c(rep("TP53", 100), "CD4")))
Error in names(x) <- value : 
  'names' attribute [101] must be the same length as the vector [2]

Found at least one other function (DotPlot()) may be affected by this issue #2879 .

bless~

mojaveazure commented 4 years ago

Please provide an reproducible example call and dataset, subset of a dataset, or example using one of the datasets in SeuratData

liu-xingliang commented 4 years ago
library(Seurat)
library(SeuratData)

data("pbmc3k")
> pbmc3k
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features, 0 variable features)

pbmc3k <- NormalizeData(pbmc3k, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc3k <- ScaleData(pbmc3k, features = rownames(pbmc3k))

pbmc3k <- FindVariableFeatures(pbmc3k, selection.method = "vst", nfeatures = 2000)
first_98_vgs <- VariableFeatures(pbmc3k)[1:98]
first_99_vgs <- VariableFeatures(pbmc3k)[1:99]

fake_gl_noerr <- c(first_98_vgs, first_98_vgs[1], first_98_vgs[1]) # with duplicates, but size of gene list not exceeding 100
fake_gl_err <- c(first_99_vgs, first_99_vgs[1], first_99_vgs[1]) # with duplicates, size of gene list exceeding 100

pbmc3k <- RunPCA(pbmc3k, features = VariableFeatures(object = pbmc3k))
pbmc3k <- FindNeighbors(pbmc3k, dims = 1:10)
pbmc3k <- FindClusters(pbmc3k, resolution = 0.5)

# test on FetchData()

head(FetchData(pbmc3k, vars=fake_gl_noerr))
> head(FetchData(pbmc3k, vars=fake_gl_noerr))
                   PPBP   S100A9 IGLL5      LYZ     GNLY      FTL PF4     FTH1
AAACATACAACCAC 0.000000 0.000000     0 1.635873 0.000000 3.924095   0 2.226555
AAACATTGAGCTAC 0.000000 0.000000     0 1.962726 0.000000 3.063189   0 3.574384
AAACATTGATCAGC 0.000000 0.000000     0 1.995416 1.429744 3.274161   0 3.582238
AAACCGTGCTTCCG 1.566387 3.838845     0 4.521175 0.000000 5.692271   0 5.919842
AAACCGTGTATGCG 0.000000 0.000000     0 0.000000 3.453545 5.218589   0 3.453545
AAACGCACTGGTAC 0.000000 0.000000     0 1.726902 0.000000 3.948457   0 2.699321
               FCER1A GNG11   S100A8  HLA-DRA     CD74 CLU     GZMB     NKG7
AAACATACAACCAC      0     0 0.000000 1.635873 1.635873   0 0.000000 0.000000
AAACATTGAGCTAC      0     0 0.000000 4.309784 4.654131   0 0.000000 1.111715
AAACATTGATCAGC      0     0 0.000000 0.000000 1.995416   0 0.000000 0.000000
AAACCGTGCTTCCG      0     0 2.515108 4.521175 3.989917   0 1.566387 1.566387
AAACCGTGTATGCG      0     0 0.000000 2.416278 2.416278   0 4.815828 4.729553
AAACGCACTGGTAC      0     0 0.000000 0.000000 0.000000   0 0.000000 0.000000
               C1QA     CST3     CCL4 HLA-DPB1 SDPR C1QB AL928768.3 TYMS TUBB1
AAACATACAACCAC    0 0.000000 0.000000 0.000000    0    0          0    0     0
AAACATTGAGCTAC    0 1.111715 0.000000 3.574384    0    0          0    0     0
AAACATTGATCAGC    0 1.429744 0.000000 0.000000    0    0          0    0     0
AAACCGTGCTTCCG    0 4.435152 0.000000 3.444082    0    0          0    0     0
AAACCGTGTATGCG    0 0.000000 3.733287 0.000000    0    0          0    0     0
AAACGCACTGGTAC    0 0.000000 0.000000 1.726902    0    0          0    0     0
               RRM2 GUSB STMN1     MZB1 C10orf32 GP9 IGJ LYPD2     HAGH TK1
AAACATACAACCAC    0    0     0 0.000000 0.000000   0   0     0 0.000000   0
AAACATTGAGCTAC    0    0     0 1.111715 1.625141   0   0     0 0.000000   0
AAACATTGATCAGC    0    0     0 0.000000 0.000000   0   0     0 0.000000   0
AAACCGTGCTTCCG    0    0     0 0.000000 0.000000   0   0     0 0.000000   0
AAACCGTGTATGCG    0    0     0 0.000000 0.000000   0   0     0 2.416278   0
AAACGCACTGGTAC    0    0     0 0.000000 0.000000   0   0     0 0.000000   0
                    FUS DSCR3     LYAR SLA SMIM7 KIF5B HBP1    PPP6C     TMX2
AAACATACAACCAC 1.635873     0 2.226555   0     0     0    0 0.000000 0.000000
AAACATTGAGCTAC 0.000000     0 1.111715   0     0     0    0 0.000000 0.000000
AAACATTGATCAGC 1.429744     0 0.000000   0     0     0    0 1.429744 1.429744
AAACCGTGCTTCCG 1.566387     0 0.000000   0     0     0    0 0.000000 0.000000
AAACCGTGTATGCG 0.000000     0 0.000000   0     0     0    0 0.000000 0.000000
AAACGCACTGGTAC 0.000000     0 0.000000   0     0     0    0 0.000000 0.000000
               HBA1 INTS12 HLA-DPA1 APOBEC3B KIAA0101 CA2     CCL3 LILRA4
AAACATACAACCAC    0      0 0.000000        0        0   0 0.000000      0
AAACATTGAGCTAC    0      0 3.314709        0        0   0 0.000000      0
AAACATTGATCAGC    0      0 0.000000        0        0   0 0.000000      0
AAACCGTGCTTCCG    0      0 2.782369        0        0   0 0.000000      0
AAACCGTGTATGCG    0      0 3.063772        0        0   0 3.453545      0
AAACGCACTGGTAC    0      0 2.326928        0        0   0 0.000000      0
                  PRDX1 GIMAP5 HLA-DRB1    CD79A SERPINF1 MYL9    PRKCB
AAACATACAACCAC 0.000000      0 0.000000 0.000000        0    0 0.000000
AAACATTGAGCTAC 0.000000      0 3.063189 1.962726        0    0 1.625141
AAACATTGATCAGC 0.000000      0 0.000000 0.000000        0    0 0.000000
AAACCGTGCTTCCG 1.566387      0 3.753788 0.000000        0    0 0.000000
AAACCGTGTATGCG 0.000000      0 0.000000 0.000000        0    0 2.416278
AAACGCACTGGTAC 2.326928      0 1.726902 0.000000        0    0 0.000000
               C16orf13     LST1 MLLT11 BIRC5 TNFRSF17     CCL5 NDUFA12 THOC7
AAACATACAACCAC        0 0.000000      0     0        0 4.322952       0     0
AAACATTGAGCTAC        0 0.000000      0     0        0 1.111715       0     0
AAACATTGATCAGC        0 0.000000      0     0        0 0.000000       0     0
AAACCGTGCTTCCG        0 2.993057      0     0        0 0.000000       0     0
AAACCGTGTATGCG        0 0.000000      0     0        0 3.453545       0     0
AAACGCACTGGTAC        0 0.000000      0     0        0 0.000000       0     0
               RABL6    SIVA1 CLDN5 STK38     SDHB TREML1 UBLCP1 IFI27 SPARC
AAACATACAACCAC     0 0.000000     0     0 0.000000      0      0     0     0
AAACATTGAGCTAC     0 0.000000     0     0 1.111715      0      0     0     0
AAACATTGATCAGC     0 0.000000     0     0 0.000000      0      0     0     0
AAACCGTGCTTCCG     0 1.566387     0     0 0.000000      0      0     0     0
AAACCGTGTATGCG     0 0.000000     0     0 0.000000      0      0     0     0
AAACGCACTGGTAC     0 0.000000     0     0 0.000000      0      0     0     0
               HIST1H2AC PTGDS  C14orf1     RALY IL8     PRF1 CLEC2B   FCGR3A
AAACATACAACCAC         0     0 0.000000 0.000000   0 0.000000      0 0.000000
AAACATTGAGCTAC         0     0 0.000000 1.111715   0 0.000000      0 0.000000
AAACATTGATCAGC         0     0 1.429744 0.000000   0 0.000000      0 0.000000
AAACCGTGCTTCCG         0     0 0.000000 0.000000   0 0.000000      0 1.566387
AAACCGTGTATGCG         0     0 0.000000 0.000000   0 3.453545      0 0.000000
AAACGCACTGGTAC         0     0 0.000000 0.000000   0 0.000000      0 0.000000
                TNFSF10    MED30     ODC1 STAMBP GMPR GZMH CWC15 DNASE1L3
AAACATACAACCAC 0.000000 0.000000 0.000000      0    0    0     0        0
AAACATTGAGCTAC 0.000000 1.111715 1.625141      0    0    0     0        0
AAACATTGATCAGC 1.429744 0.000000 0.000000      0    0    0     0        0
AAACCGTGCTTCCG 0.000000 0.000000 0.000000      0    0    0     0        0
AAACCGTGTATGCG 0.000000 0.000000 0.000000      0    0    0     0        0
AAACGCACTGGTAC 0.000000 0.000000 0.000000      0    0    0     0        0
                TMEM208   FGFBP2   TYROBP CXCL3     IDH2     ACTB     PPBP
AAACATACAACCAC 0.000000 0.000000 0.000000     0 0.000000 4.266573 0.000000
AAACATTGAGCTAC 1.111715 0.000000 0.000000     0 0.000000 4.062302 0.000000
AAACATTGATCAGC 1.429744 1.429744 0.000000     0 1.429744 4.567768 0.000000
AAACCGTGCTTCCG 0.000000 0.000000 3.838845     0 0.000000 3.989917 1.566387
AAACCGTGTATGCG 0.000000 3.063772 2.416278     0 0.000000 5.371966 0.000000
AAACGCACTGGTAC 1.726902 0.000000 0.000000     0 0.000000 4.376946 0.000000
                   PPBP
AAACATACAACCAC 0.000000
AAACATTGAGCTAC 0.000000
AAACATTGATCAGC 0.000000
AAACCGTGCTTCCG 1.566387
AAACCGTGTATGCG 0.000000
AAACGCACTGGTAC 0.000000

head(FetchData(pbmc3k, vars=fake_gl_err))
> head(FetchData(pbmc3k, vars=fake_gl_err))
Error in names(x) <- value : 
  'names' attribute [101] must be the same length as the vector [99]

# test on DotPlot() which call FetchData()

DotPlot(pbmc3k, features = fake_gl_err)
> DotPlot(pbmc3k, features = fake_gl_err)
Error in names(x) <- value : 
  'names' attribute [101] must be the same length as the vector [99]

# successful run after deduplication
DotPlot(pbmc3k, features = unique(fake_gl_err)) + RotatedAxis()

tmp

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /home/anaconda3/envs/R363/lib/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] pbmc3k.SeuratData_3.0.0 SeuratData_0.2.1        Seurat_3.1.4.9020      

loaded via a namespace (and not attached):
  [1] nlme_3.1-145        tsne_0.1-3          fs_1.4.1           
  [4] bitops_1.0-6        usethis_1.5.1       devtools_2.2.2     
  [7] RcppAnnoy_0.0.16    RColorBrewer_1.1-2  httr_1.4.1         
 [10] rprojroot_1.3-2     backports_1.1.6     sctransform_0.2.1  
 [13] tools_3.6.3         R6_2.4.1            irlba_2.3.3        
 [16] KernSmooth_2.23-16  uwot_0.1.8          lazyeval_0.2.2     
 [19] colorspace_1.4-1    withr_2.1.2         npsurv_0.4-0       
 [22] tidyselect_1.0.0    gridExtra_2.3       prettyunits_1.1.1  
 [25] processx_3.4.2      curl_4.3            compiler_3.6.3     
 [28] cli_2.0.2           desc_1.2.0          plotly_4.9.2.1     
 [31] labeling_0.3        caTools_1.18.0      scales_1.1.0       
 [34] lmtest_0.9-37       ggridges_0.5.2      callr_3.4.3        
 [37] pbapply_1.4-2       rappdirs_0.3.1      stringr_1.4.0      
 [40] digest_0.6.25       pkgconfig_2.0.3     htmltools_0.4.0    
 [43] sessioninfo_1.1.1   htmlwidgets_1.5.1   rlang_0.4.5        
 [46] farver_2.0.3        zoo_1.8-7           jsonlite_1.6.1     
 [49] ica_1.0-2           gtools_3.8.2        dplyr_0.8.5        
 [52] magrittr_1.5        patchwork_1.0.0     Matrix_1.2-18      
 [55] Rcpp_1.0.4          munsell_0.5.0       fansi_0.4.1        
 [58] ape_5.3             reticulate_1.15     lifecycle_0.2.0    
 [61] stringi_1.4.6       MASS_7.3-51.5       pkgbuild_1.0.6     
 [64] gplots_3.0.3        Rtsne_0.15          plyr_1.8.6         
 [67] grid_3.6.3          parallel_3.6.3      gdata_2.18.0       
 [70] listenv_0.8.0       ggrepel_0.8.2       crayon_1.3.4       
 [73] lattice_0.20-41     cowplot_1.0.0       splines_3.6.3      
 [76] ps_1.3.2            pillar_1.4.3        igraph_1.2.5       
 [79] pkgload_1.0.2       future.apply_1.4.0  reshape2_1.4.3     
 [82] codetools_0.2-16    leiden_0.3.3        glue_1.4.0         
 [85] lsei_1.2-0          remotes_2.1.1       data.table_1.12.8  
 [88] png_0.1-7           vctrs_0.2.4         testthat_2.3.2     
 [91] gtable_0.3.0        RANN_2.6.1          purrr_0.3.3        
 [94] tidyr_1.0.2         future_1.16.0       assertthat_0.2.1   
 [97] ggplot2_3.3.0       rsvd_1.0.3          survival_3.1-11    
[100] viridisLite_0.3.0   tibble_3.0.0        memoise_1.1.0      
[103] cluster_2.1.0       globals_0.12.5      fitdistrplus_1.0-14
[106] ellipsis_0.3.0      ROCR_1.0-7      
rsatija commented 1 year ago

Closing as I believe we no longer use pmap in FetchData in Seurat v5