mathewchamberlain / SignacX

Signac
GNU General Public License v3.0
23 stars 5 forks source link

Using small number of markers (n= 47) returns error #14

Closed yi6kim closed 2 years ago

yi6kim commented 2 years ago

Hello,

I have a few custom datasets consisting of (47 markers) x (1,000 ~ 10,000 cells) on which I want to run a supervised cell annotation.

Initially, running Signac on my dataset gave me some errors, so I wanted to figure out what the issue is and wondered if it could be stemming from the small number of rows (47 markers) that I used, perhaps causing some mathematical/linear algebraic issues.

So I tried to size down the given "pbmc" dataset from the vignette (https://cran.r-project.org/web/packages/SignacX/vignettes/signac-Seurat_CITE-seq.html) to contain the first 47 markers only. This process is shown in the code below. (FYI: the original "pbmc" dataset from the vignette contains 33538 markers x 7865 cells.)

After running SignacX on this "smaller" pbmc dataset, I noticed that the same types of error are produced. Specifically, these error messages pop up after "SCTransform" and "Signac" functions, and I commented the exact error messages below. For SCTransform, I eventually skipped this step and instead used the "NormalizeData, FindVariableFeatures and ScaleData" sequence, which did not produce any error.

On a note, I tried to adjust the parameters such as "npcs", "nfeatures.print", and "dims" in the functions "RunPCA", "RunUMAP" and "FindNeighbors" functions wondering if these could be the issues, but no avail. However, I'm not too familiar with parameters in these functions, so my parameters may still be wrong.

What could possibly be the issue here and how can I reduce these error messages?

Thank you!

library(Seurat)
require(SignacX)
# Minimally reproducible example
E = Read10X_h5(filename = "fls/pbmc_10k_protein_v3_filtered_feature_bc_matrix.h5")
E.small <- E$`Gene Expression`[c(1:47),]

pbmc <- CreateSeuratObject(counts = E.small, project = "pbmc")

#pbmc <- SCTransform(pbmc) #, variable.features.n = 30

# Error message # 1:
#Calculating cell attributes from input UMI matrix: log_umi
#Error in make_cell_attr(umi, cell_attr, latent_var, batch_var, latent_var_nonreg,  : 
#                          cell attribute "log_umi" contains NA, NaN, or infinite value

pbmc <- NormalizeData(pbmc)
pbmc <- FindVariableFeatures(pbmc)
pbmc <- ScaleData(pbmc)

pbmc <- RunPCA(pbmc, npcs=10, nfeatures.print = 10)
pbmc <- RunUMAP(pbmc, dims = 1:10)
pbmc <- FindNeighbors(pbmc, dims = 1:10)

labels <- Signac(pbmc, verbose=T)

# Error message # 2:

#..........  Entry in Signac 
#..........  Running Signac on Seurat object :
#  nrow = 47
#  ncol = 7865
# |                                  |   0%, ETA NA

# Error in order(rownames(Z)) : argument 1 is not a vector

sessionInfo()

R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.2.1

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] pbmc3k.SeuratData_3.1.4 SeuratData_0.2.2 SignacX_2.2.5 patchwork_1.1.1 [5] ggplot2_3.3.6 SeuratDisk_0.0.0.9020 sp_1.4-7 SeuratObject_4.1.0 [9] Seurat_4.1.1

loaded via a namespace (and not attached): [1] Rtsne_0.16 colorspace_2.0-3 deldir_1.0-6 ellipsis_0.3.2 [5] ggridges_0.5.3 rprojroot_2.0.3 fs_1.5.2 rstudioapi_0.13 [9] spatstat.data_2.2-0 farver_2.1.0 leiden_0.4.2 listenv_0.8.0 [13] remotes_2.4.2 ggrepel_0.9.1 bit64_4.0.5 RSpectra_0.16-1 [17] fansi_1.0.3 codetools_0.2-18 splines_4.1.3 cachem_1.0.6 [21] pkgload_1.2.4 polyclip_1.10-0 jsonlite_1.8.0 ica_1.0-2 [25] cluster_2.1.3 png_0.1-7 rgeos_0.5-9 uwot_0.1.11 [29] shiny_1.7.1 sctransform_0.3.3 spatstat.sparse_2.1-1 compiler_4.1.3 [33] httr_1.4.3 Matrix_1.4-1 fastmap_1.1.0 lazyeval_0.2.2 [37] cli_3.3.0 later_1.3.0 prettyunits_1.1.1 htmltools_0.5.2 [41] tools_4.1.3 igraph_1.3.1 gtable_0.3.0 glue_1.6.2 [45] RANN_2.6.1 reshape2_1.4.4 dplyr_1.0.9 rappdirs_0.3.3 [49] Rcpp_1.0.8.3 scattermore_0.8 vctrs_0.4.1 nlme_3.1-157 [53] progressr_0.10.0 lmtest_0.9-40 spatstat.random_2.2-0 stringr_1.4.0 [57] brio_1.1.3 ps_1.7.0 globals_0.15.0 testthat_3.1.4 [61] mime_0.12 miniUI_0.1.1.1 lifecycle_1.0.1 irlba_2.3.5 [65] devtools_2.4.3 goftest_1.2-3 future_1.26.1 MASS_7.3-57 [69] zoo_1.8-10 scales_1.2.0 spatstat.core_2.4-4 promises_1.2.0.1 [73] spatstat.utils_2.3-1 parallel_4.1.3 RColorBrewer_1.1-3 curl_4.3.2 [77] memoise_2.0.1 reticulate_1.25 pbapply_1.5-0 gridExtra_2.3 [81] rpart_4.1.16 stringi_1.7.6 desc_1.4.1 pkgbuild_1.3.1 [85] rlang_1.0.2 pkgconfig_2.0.3 matrixStats_0.62.0 lattice_0.20-45 [89] ROCR_1.0-11 purrr_0.3.4 tensor_1.5 htmlwidgets_1.5.4 [93] labeling_0.4.2 processx_3.5.3 cowplot_1.1.1 bit_4.0.4 [97] tidyselect_1.1.2 parallelly_1.31.1 RcppAnnoy_0.0.19 plyr_1.8.7 [101] magrittr_2.0.3 R6_2.5.1 generics_0.1.2 pillar_1.7.0 [105] withr_2.5.0 mgcv_1.8-40 fitdistrplus_1.1-8 survival_3.3-1 [109] abind_1.4-5 tibble_3.1.7 future.apply_1.9.0 crayon_1.5.1 [113] hdf5r_1.3.5 KernSmooth_2.23-20 utf8_1.2.2 spatstat.geom_2.4-0 [117] plotly_4.10.0 usethis_2.1.6 grid_4.1.3 data.table_1.14.2 [121] callr_3.7.0 digest_0.6.29 pbmcapply_1.5.1 xtable_1.8-4 [125] tidyr_1.2.0 httpuv_1.6.5 munsell_0.5.0 viridisLite_0.4.0 [129] sessioninfo_1.2.2 Warning messages: 1: ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps 2: ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps 3: ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps 4: ggrepel: 5 unlabeled data points (too many overlaps). Consider increasing max.overlaps

mathewchamberlain commented 2 years ago

Hi @e-junekim,

Thanks for posting this issue. SignacX trains neural networks with cell type markers that are based on genome-wide RNA-sequencing. The belief is that 1, 2 or 3 genes are probably too few to identify nuanced cell types, but using hundreds of genes together with classifiers should work.

So the problem here is that your data have too few genes for Signac to classify the cell types, because there are too few features to train the models. This method was intended to be used with genome wide panels, and not very small subsets of genes.

Hope this helps!