hemberg-lab / SC3

A tool for the unsupervised clustering of cells from single cell RNA-Seq experiments
http://bioconductor.org/packages/SC3
GNU General Public License v3.0
118 stars 55 forks source link

sc3_plot_de_genes and sc3_plot_markers problems #84

Closed Santos22903 closed 5 years ago

Santos22903 commented 5 years ago

Hi,

I was going through the SC3 manual and, once I got to looking at some biology, I got problems. Both sc3_plot_de_genes and sc3_plot_markers failed with the following messages: Error in dataset[names(de_genes), , drop = FALSE] : subscript out of bounds for sc3_plot_de_genes and Error in dataset[markers$feature_symbol, , drop = FALSE] : subscript out of bounds for sc3_plot_markers.

Feature names (the most common source of SC3-related problems, it seems) are there: head(rowData(sce)$feature_symbol) [1] "Sox17" "Mrpl15" "Lypla1" "Tcea1" "Atp6v1h" "Rb1cc1"

`> sessionInfo() R version 3.4.4 (2018-03-15) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: OS X El Capitan 10.11.6

Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] biomaRt_2.34.2 SC3_1.7.7 scater_1.6.3
[4] ggplot2_3.0.0 SingleCellExperiment_1.0.0 SummarizedExperiment_1.8.1 [7] DelayedArray_0.4.1 matrixStats_0.54.0 Biobase_2.38.0
[10] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0 IRanges_2.12.0
[13] S4Vectors_0.16.0 BiocGenerics_0.24.0

loaded via a namespace (and not attached): [1] bitops_1.0-6 bit64_0.9-7 RColorBrewer_1.1-2
[4] doParallel_1.0.14 progress_1.2.0 httr_1.3.1
[7] doRNG_1.7.1 tools_3.4.4 R6_2.3.0
[10] KernSmooth_2.23-15 vipor_0.4.5 DBI_1.0.0
[13] lazyeval_0.2.1 colorspace_1.3-2 withr_2.1.2
[16] tidyselect_0.2.5 gridExtra_2.3 prettyunits_1.0.2
[19] curl_3.2 bit_1.1-14 compiler_3.4.4
[22] pkgmaker_0.27 caTools_1.17.1.1 scales_1.0.0
[25] mvtnorm_1.0-8 DEoptimR_1.0-8 robustbase_0.92-8
[28] stringr_1.3.1 digest_0.6.18 XVector_0.18.0
[31] rrcov_1.4-4 pkgconfig_2.0.2 htmltools_0.3.6
[34] bibtex_0.4.2 WriteXLS_4.0.0 limma_3.34.9
[37] rlang_0.2.2 RSQLite_2.1.1 shiny_1.1.0
[40] bindr_0.1.1 gtools_3.8.1 dplyr_0.7.6
[43] RCurl_1.95-4.11 magrittr_1.5 GenomeInfoDbData_1.0.0 [46] Matrix_1.2-14 Rcpp_0.12.19 ggbeeswarm_0.6.0
[49] munsell_0.5.0 viridis_0.5.1 stringi_1.2.4
[52] edgeR_3.20.9 zlibbioc_1.24.0 rhdf5_2.22.0
[55] gplots_3.0.1 plyr_1.8.4 grid_3.4.4
[58] blob_1.1.1 gdata_2.18.0 promises_1.0.1
[61] shinydashboard_0.7.0 crayon_1.3.4 lattice_0.20-35
[64] hms_0.4.2 locfit_1.5-9.1 pillar_1.3.0
[67] rjson_0.2.20 rngtools_1.3.1 codetools_0.2-15
[70] reshape2_1.4.3 XML_3.98-1.16 glue_1.3.0
[73] data.table_1.11.8 httpuv_1.4.5 foreach_1.4.4
[76] gtable_0.2.0 purrr_0.2.5 assertthat_0.2.0
[79] mime_0.6 xtable_1.8-3 e1071_1.7-0
[82] later_0.7.5 pcaPP_1.9-73 class_7.3-14
[85] viridisLite_0.3.0 pheatmap_1.0.10 tibble_1.4.2
[88] iterators_1.0.10 registry_0.5 AnnotationDbi_1.40.0
[91] beeswarm_0.2.3 memoise_1.1.0 tximport_1.6.0
[94] bindrcpp_0.2.2 cluster_2.0.7-1 ROCR_1.0-7 `

Any suggestions how I can proceed? Thank you.

Santos22903 commented 5 years ago

Hi all,

After some hacking/digging I found the problem and want to share it here; might be useful for some. Inside the sc3_plot_de_genes (I did not look into the other function, but error messages are the same), they use feature_names to select a certain number of genes (50 is hardwired I guess) and then use names from feature_names for indexing to extract certain entries from the matrix they extract from your SingleCellObject. Naturally, feature_names would better coincide with rownames, otherwise the program complains. This was the case for me. I had ensembl IDs as row names all along (as you are aware, ensembl ID - gene symbol is not a 1-1 correspondence and you might get in trouble with duplicated row names), and I converted them to gene symbols and assigned them as feature_names hoping to get plots with gene names and not ensembl IDs. That was the problem.

I went around this by assigning feature names to row names (inside the routine, they extract a matrix from your SSE, I guess, that is why this worked). Well, it worked, and I got informative plots.

As a side note, this functionality is a bit surprising and authors might want to look into the case of differing row names and feature names.

Thank you all, case closed

jfass commented 5 years ago

THANK YOU! This comment saved me.

Since many genes have no gene symbols ("HGNC Symbol" in BioMart), I generally use Ensembl ENSG ids through most of the analysis, then tack on gene symbols at the end once folks familiar with the biology will be looking at the results. The SC3 vignette says:

SC3 also requires the feature_symbol column of the rowData slot of the input SingleCellExperiment object to contain preferable feature names (genes/transcript) which will be used in the futher visualisations.

So, I assumed it was fine to have ENSG ids as rownames of the counts, but gene symbols (with many duplicates == empty strings) in rowData's "feature_symbols". But this was leading to sc3_interactive showing numbers of DE or marker genes (in the text at top right), but instead of the heatmaps, just red text saying "subscript out of bounds".

I then changed rownames( object ) and rowData( object )$feature_symbol to ~paste( ENSGs, HGNC.Symbols, sep=" ... " ) ... so they're both unique, and identical to each other ... this fixed the DE gene / Marker gene heatmap issues.

Thanks again, @Santos22903.