kharchenkolab / gpsFISH

Optimization of gene panels for targeted spatial transcriptomics
Other
6 stars 1 forks source link

running gpsFISH without spatial count and spatial cluster #18

Open justin512lee opened 1 year ago

justin512lee commented 1 year ago

Hi,

Is it possible to run gpsFISH without spatial count and spatial cluster datasets (or can it run with spatial count and cluster datasets provided in the tutorial)? I am working with the datasets from this paper and have sc_counts and sc_clusters. Any help would be appreciated!

YidaZhang0628 commented 1 year ago

Can you elaborate a bit more on what you plan to do? If you want to design a gene panel for MERFISH/osmFISH/DARTFISH, you can run gpsFISH with only single cell RNA-seq because we have already trained Bayesian models for those three platforms. If you are designing gene panels for other platforms and you don't have spatial data, you can run gpsFISH without simulation. This way, gene panels will be evaluated using scRNA-seq data only.

justin512lee commented 1 year ago

Hi @YidaZhang0628. Thank you for your quick response. I want to design a gene panel for MERFISH. So I need to run the Platform effect estimation tutorial to get simulation_params and then complete the Gene Panel Selection tutorial using my own scRNA-seq data. Is this correct?

YidaZhang0628 commented 1 year ago

If you have scRNA-seq data and MERFISH data from your target tissue, then that will be the suggestion. But if you don't, we have a pre-trained Bayesian model for MERFISH. You can find it here. With the pre-trained model, you can skip the platform effect estimation step and directly run gene panel selection.

justin512lee commented 1 year ago

Thank you for your help. I was able to run everything except the gpsFISH_optimize function. I'm getting this message: Error in plogis(x) : Non-numeric argument to mathematical function

Thanks again for your help.

pakiessling commented 1 year ago

@YidaZhang0628 do you recommend using the pre-trained models even for tissues not in the training dataset (i.e. non mouse hippocampus for MERFISH)? Do the encoded platform effects generalize well?

YidaZhang0628 commented 1 year ago

@justin512lee Can you provide a bit more information? Are you getting this error when running the tutorial or on your own data? The first thing I would check is to make sure all the input files have the same data type and format as in the tutorial. If the error is still there, can you provide your code and a small dataset so that I can reproduce the error?

YidaZhang0628 commented 1 year ago

@pakiessling This is something we are planning to check but right now, we don't know how generalizable the platform effects are for the same genes in the same platform but across different tissues. Therefore, if you can find paired scRNA-seq data and spatial data in your target tissue, it is still recommended to train the Bayesian model first.

Boehmin commented 7 months ago

I have snRNA-seq data and no spatial data so my use case falls into a similar category as above I believe. Until now everything ran reasonably smoothly following the tutorial. I am stuck at the gpsFISH_optimize step.

The below is my error message regarding NA in the probability vector.

GA = gpsFISH_optimize(earlyterm = 10,
+                       converge.cutoff = 0.01,
+                       n = dim(my_sc_count)[1],
+                       k = panel_size,
+                       ngen = 10,
+                       popsize = pop_size,
+                       verbose = 1,
+                       cluster = 40, # number of cores
+                       initpop = initpop,
+                       method = "NaiveBayes",
+                       metric = "Accuracy",
+                       nCV = 5,
+                       rate = 1,
+                       cluster_size_max = 50,
+                       cluster_size_min = 30,
+                       two_step_sampling_type = c("Subsampling_by_cluster", "Simulation"),
+                       simulation_model = "ZINB",
+                       sample_new_levels = "old_levels",
+                       use_average_cluster_profiles = FALSE,
+                       save.intermediate = TRUE,
+                       full_count_table = as.data.frame(t(my_sc_count)),
+                       cell_cluster_conversion = my_sc_cluster,       
+                       relative_prop = relative_prop,
+                       simulation_parameter = sim_parameters,
+                       gene2include.id = gene2include.id,
+                       gene.weight = gene.weight,
+                       weight_penalty = weight_penalty
+ )
Initial population best OF value =  0.5831973 . Population diversity =  75.98061 
Error in sample.int(length(x), size, replace, prob) : 
  NA in probability vector

I use the same values for pop_size, panel_size etc. as in the tutorial. The only difference is that I use our data for sc_count and sc_cluster. I tried to make sure the data is formatted as expected throughout.

Have you encountered this error previously? Does this have to do with the sim_parameters? Or my data? Thanks for your help in advance!

YidaZhang0628 commented 7 months ago

I have snRNA-seq data and no spatial data so my use case falls into a similar category as above I believe. Until now everything ran reasonably smoothly following the tutorial. I am stuck at the gpsFISH_optimize step.

The below is my error message regarding NA in the probability vector.

GA = gpsFISH_optimize(earlyterm = 10,
+                       converge.cutoff = 0.01,
+                       n = dim(my_sc_count)[1],
+                       k = panel_size,
+                       ngen = 10,
+                       popsize = pop_size,
+                       verbose = 1,
+                       cluster = 40, # number of cores
+                       initpop = initpop,
+                       method = "NaiveBayes",
+                       metric = "Accuracy",
+                       nCV = 5,
+                       rate = 1,
+                       cluster_size_max = 50,
+                       cluster_size_min = 30,
+                       two_step_sampling_type = c("Subsampling_by_cluster", "Simulation"),
+                       simulation_model = "ZINB",
+                       sample_new_levels = "old_levels",
+                       use_average_cluster_profiles = FALSE,
+                       save.intermediate = TRUE,
+                       full_count_table = as.data.frame(t(my_sc_count)),
+                       cell_cluster_conversion = my_sc_cluster,       
+                       relative_prop = relative_prop,
+                       simulation_parameter = sim_parameters,
+                       gene2include.id = gene2include.id,
+                       gene.weight = gene.weight,
+                       weight_penalty = weight_penalty
+ )
Initial population best OF value =  0.5831973 . Population diversity =  75.98061 
Error in sample.int(length(x), size, replace, prob) : 
  NA in probability vector

I use the same values for pop_size, panel_size etc. as in the tutorial. The only difference is that I use our data for sc_count and sc_cluster. I tried to make sure the data is formatted as expected throughout.

Have you encountered this error previously? Does this have to do with the sim_parameters? Or my data? Thanks for your help in advance!

Do you mind providing me the dimensions of your scRNA-seq data and also the number of cell types?

Boehmin commented 7 months ago

Thanks for such a quick reply! These are my dimensions/nr of cells/nuclei, I have 18 class_labels:

> dim(my_sc_cluster)
[1] 81389     2
> dim(my_sc_count)
[1]  1356 81389

EDIT: I found the error!

  1. I already had to deal with NAs when normalising over the sum(weight.list), I did the following: weight.list = weight.list / sum(weight.list, na.rm = TRUE)
  2. I was still left with one NA that I did not find until just now (I assumed the above trick took care of it, my bad). I had followed the tutorial by censoring the probe count, using the probe_count file that came with the gpsFISH package. The NA appeared only after the normalisation. I now replaced the value with a value that occurred most regularly for other targets and was able to run the rest of the code without any issues at all! Thank you.
simomounir commented 7 months ago

I think this can go under the same issue above:

I have my own single-cell refence data (breast cancer dataset, ~70k cell) and I have my spatial data based on sm-FISH (from breast cancer tissues as well) with a gene expression matrix per segmented cell. How can I use gpsFISH to perform label transfer from reference single-cell data to spatial data? I see some confusion matrices, but they were calculated from predictions of cell types of single-cell data right? Should I also skip the Platform-estimation phase?

If there is any devised method to follow, let me know. Thanks in advance!

YidaZhang0628 commented 7 months ago

Thanks for such a quick reply! These are my dimensions/nr of cells/nuclei, I have 18 class_labels:

> dim(my_sc_cluster)
[1] 81389     2
> dim(my_sc_count)
[1]  1356 81389

EDIT: I found the error!

  1. I already had to deal with NAs when normalising over the sum(weight.list), I did the following: weight.list = weight.list / sum(weight.list, na.rm = TRUE)
  2. I was still left with one NA that I did not find until just now (I assumed the above trick took care of it, my bad). I had followed the tutorial by censoring the probe count, using the probe_count file that came with the gpsFISH package. The NA appeared only after the normalisation. I now replaced the value with a value that occurred most regularly for other targets and was able to run the rest of the code without any issues at all! Thank you.

Thank you for letting me know! I was on vacation for the past few days and couldn't get back to you. I am happy to know that the problem has been solved. Please don't hesitate to reach out if you have any further questions.

One thing I noticed is that you mentioned you are using the probe_count file in the gpsFISH package. That information is specific to DARTFISH. Therefore, if you are designing your gene panel for a different technology (like MERFISH), that probe_count file is likely to be non-informative because the probe design mechanisms used by different technologies are different. I just want to point this out in case this information is relevant.

simomounir commented 7 months ago

Hi there,

Let me know if there is any update on this https://github.com/kharchenkolab/gpsFISH/issues/18#issuecomment-1927066809

Thanks a lot!

YidaZhang0628 commented 7 months ago

I think this can go under the same issue above:

I have my own single-cell refence data (breast cancer dataset, ~70k cell) and I have my spatial data based on sm-FISH (from breast cancer tissues as well) with a gene expression matrix per segmented cell. How can I use gpsFISH to perform label transfer from reference single-cell data to spatial data? I see some confusion matrices, but they were calculated from predictions of cell types of single-cell data right? Should I also skip the Platform-estimation phase?

If there is any devised method to follow, let me know. Thanks in advance!

@simomounir To make sure I understand your goal correctly, you have scRNA-seq data that is already annotated and you have smFISH data that is not annotated. You want to annotate the smFISH data using your scRNA-seq data, is that correct? If yes, then there are many methods for this task, for example, label transfer from Seurat. What gpsFISH does is to select gene panels if you want to design a new spatial transcriptomics experiment. Both the platform effect estimation and gene panel selection steps rely on data with known labels. For example, the platform effect estimation step requires both input scRNA-seq and spatial data to be annotated. The gene panel selection step requires the input scRNA-seq data to be annotated. Let me know if this addresses your question.

simomounir commented 7 months ago

I think this can go under the same issue above: I have my own single-cell refence data (breast cancer dataset, ~70k cell) and I have my spatial data based on sm-FISH (from breast cancer tissues as well) with a gene expression matrix per segmented cell. How can I use gpsFISH to perform label transfer from reference single-cell data to spatial data? I see some confusion matrices, but they were calculated from predictions of cell types of single-cell data right? Should I also skip the Platform-estimation phase? If there is any devised method to follow, let me know. Thanks in advance!

@simomounir To make sure I understand your goal correctly, you have scRNA-seq data that is already annotated and you have smFISH data that is not annotated. You want to annotate the smFISH data using your scRNA-seq data, is that correct? If yes, then there are many methods for this task, for example, label transfer from Seurat. What gpsFISH does is to select gene panels if you want to design a new spatial transcriptomics experiment. Both the platform effect estimation and gene panel selection steps rely on data with known labels. For example, the platform effect estimation step requires both input scRNA-seq and spatial data to be annotated. The gene panel selection step requires the input scRNA-seq data to be annotated. Let me know if this addresses your question.

Yes indeed, that addresses my question perfectly. I was trying to see if I can harvest gpsFISH functionalities to predict unknown cell labels. I have a labeled smFISH dataset though, I was just trying to test the prediction module.

Thanks a lot though for your prompt answer! I will be using gpsFISH for designing some experiments.

evaknichols commented 6 months ago

Hello! Thank you for producing this tool, which I think has great potential for reviving an earlier thread in my research. I'm so excited to get it working! I am posting here because I think my question also falls under this open issue, and perhaps its discussion can also help others.

I'm working with a subset of snRNA-seq data from the Mouse Organogenesis Cell Atlas (MOCA) here. It has 265124 cells and 24552 genes. For now, I have no spatial transcriptomics dataset that I want to use; I simply want to see what selected gene panel looks like for the 11 main trajectory clusters (and later the 39 major cell types) so I can design FISH experiments.

I was able to reproduce code from the gpsFISH paper seamlessly with the posted .RMD file (which was fantastic!), so I started to follow along with my subset of snRNA-seq data. I skipped (for now) nonessential steps like constructing the weighted penalty matrix, including specific genes that might have been filtered out, and including information about available probes. After filtering genes, 829 remain (which feels really low, but I move on for now...).

Final dimensions of my sc_count are 829 x 265124. Calculating DEGs for each cell type worked fine; I started having problems with the gpsFISH_optimize step. I used default params as in the example .RMD file, except I set gene2include.id, gene.weight, and weight_penalty to NULL.

After running for about a full ten seconds, this error message popped up: Error in plogis(x) : Non-numeric argument to mathematical function

When googling this error, it seems to happen when a mathematical function is applied to non-numerical data. I figured the problem might be with my own data since the code worked fine in the demo .RMD file. I noticed that my sc_count object was a sparse rather than dense matrix (and earlier throughout I had gotten warnings speaking to this), so perhaps that was causing a problem? Especially since in the Platform Effects demo code it looked as though that was a dense matrix. However, converting it to a dense object produced the same error. I next verified that no NAs were found in my sc_count or sc_clusterobjects, and that rownames of the full_count_table matched that of sc_cluster.

Is this a familiar error, and might you have recommendations on how I can correct it? Thank you! I'm hopeful that, despite the size and complexity of the MOCA data, I can use this tool to design cool FISH experiments.

Below I include my output of sessionInfo().

R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.1

Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] monocle3_1.2.9 SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1 GenomicRanges_1.48.0
[5] GenomeInfoDb_1.32.3 IRanges_2.30.1 S4Vectors_0.34.0 MatrixGenerics_1.8.1
[9] matrixStats_1.2.0 Biobase_2.56.0 BiocGenerics_0.42.0 cowplot_1.1.1
[13] dplyr_1.1.4 ggdendro_0.1.23 SeuratObject_4.1.3 Seurat_4.3.0
[17] reshape2_1.4.4 pheatmap_1.0.12 boot_1.3-28 splitTools_1.0.1
[21] naivebayes_0.9.7 pROC_1.18.5 caret_6.0-94 lattice_0.20-45
[25] ggplot2_3.4.4 ranger_0.16.0 pagoda2_1.0.11 igraph_1.5.1
[29] Matrix_1.5-3 data.table_1.14.10 gpsFISH_0.1.0

loaded via a namespace (and not attached): [1] utf8_1.2.4 spatstat.explore_3.0-6 reticulate_1.28 R.utils_2.12.3 lme4_1.1-30
[6] tidyselect_1.2.0 RSQLite_2.2.16 AnnotationDbi_1.58.0 htmlwidgets_1.6.1 grid_4.2.1
[11] Rtsne_0.17 munsell_0.5.0 codetools_0.2-18 ica_1.0-3 future_1.33.1
[16] miniUI_0.1.1.1 withr_3.0.0 spatstat.random_3.1-3 colorspace_2.1-0 progressr_0.14.0
[21] knitr_1.42 rstudioapi_0.14 ROCR_1.0-11 tensor_1.5 listenv_0.9.1
[26] urltools_1.7.3 GenomeInfoDbData_1.2.8 polyclip_1.10-4 topGO_2.48.0 bit64_4.0.5
[31] parallelly_1.36.0 vctrs_0.6.5 generics_0.1.3 ipred_0.9-14 xfun_0.37
[36] timechange_0.2.0 R6_2.5.1 DelayedArray_0.22.0 bitops_1.0-7 spatstat.utils_3.0-1
[41] cachem_1.0.7 promises_1.2.0.1 scales_1.3.0 nnet_7.3-17 gtable_0.3.4
[46] globals_0.16.2 goftest_1.2-3 drat_0.2.4 timeDate_4032.109 rlang_1.1.3
[51] splines_4.2.1 lazyeval_0.2.2 ModelMetrics_1.2.2.2 spatstat.geom_3.0-6 brew_1.0-10
[56] yaml_2.3.7 abind_1.4-5 httpuv_1.6.9 tools_4.2.1 lava_1.7.3
[61] sccore_1.0.4 ellipsis_0.3.2 RColorBrewer_1.1-3 ggridges_0.5.6 Rcpp_1.0.11
[66] plyr_1.8.9 zlibbioc_1.42.0 RCurl_1.98-1.8 purrr_1.0.2 dendsort_0.3.4
[71] rpart_4.1.16 deldir_1.0-6 pbapply_1.7-0 zoo_1.8-11 ggrepel_0.9.4
[76] cluster_2.1.4 magrittr_2.0.3 scattermore_0.8 SparseM_1.81 lmtest_0.9-40
[81] triebeard_0.4.1 RANN_2.6.1 fitdistrplus_1.1-8 pkgload_1.3.0 patchwork_1.1.2
[86] mime_0.12 evaluate_0.20 xtable_1.8-4 RMTstat_0.3.1 N2R_1.0.1
[91] gridExtra_2.3 compiler_4.2.1 tibble_3.2.1 KernSmooth_2.23-20 crayon_1.5.2
[96] minqa_1.2.4 R.oo_1.26.0 htmltools_0.5.4 mgcv_1.8-40 later_1.3.0
[101] tidyr_1.3.0 lubridate_1.9.3 DBI_1.1.3 MASS_7.3-58.1 cli_3.6.2
[106] R.methodsS3_1.8.2 gower_1.0.1 pkgconfig_2.0.3 sp_1.6-0 terra_1.6-7
[111] plotly_4.10.1.9000 spatstat.sparse_3.0-0 recipes_1.0.9 foreach_1.5.2 hardhat_1.3.1
[116] XVector_0.36.0 prodlim_2023.08.28 stringr_1.5.1 digest_0.6.33 sctransform_0.3.5
[121] RcppAnnoy_0.0.21 graph_1.74.0 spatstat.data_3.0-0 Biostrings_2.64.1 rmarkdown_2.20
[126] leiden_0.4.3 Rook_1.2 uwot_0.1.16 shiny_1.7.4 nloptr_2.0.3
[131] rjson_0.2.21 lifecycle_1.0.4 nlme_3.1-159 jsonlite_1.8.8 viridisLite_0.4.2
[136] fansi_1.0.6 pillar_1.9.0 GO.db_3.15.0 KEGGREST_1.36.3 fastmap_1.1.1
[141] httr_1.4.5 survival_3.4-0 glue_1.6.2 png_0.1-8 iterators_1.0.14
[146] bit_4.0.4 class_7.3-20 stringi_1.8.3 blob_1.2.3 memoise_2.0.1
[151] irlba_2.3.5.1 future.apply_1.11.1

YidaZhang0628 commented 6 months ago

Hello! Thank you for producing this tool, which I think has great potential for reviving an earlier thread in my research. I'm so excited to get it working! I am posting here because I think my question also falls under this open issue, and perhaps its discussion can also help others.

I'm working with a subset of snRNA-seq data from the Mouse Organogenesis Cell Atlas (MOCA) here. It has 265124 cells and 24552 genes. For now, I have no spatial transcriptomics dataset that I want to use; I simply want to see what selected gene panel looks like for the 11 main trajectory clusters (and later the 39 major cell types) so I can design FISH experiments.

I was able to reproduce code from the gpsFISH paper seamlessly with the posted .RMD file (which was fantastic!), so I started to follow along with my subset of snRNA-seq data. I skipped (for now) nonessential steps like constructing the weighted penalty matrix, including specific genes that might have been filtered out, and including information about available probes. After filtering genes, 829 remain (which feels really low, but I move on for now...).

Final dimensions of my sc_count are 829 x 265124. Calculating DEGs for each cell type worked fine; I started having problems with the gpsFISH_optimize step. I used default params as in the example .RMD file, except I set gene2include.id, gene.weight, and weight_penalty to NULL.

After running for about a full ten seconds, this error message popped up: Error in plogis(x) : Non-numeric argument to mathematical function

When googling this error, it seems to happen when a mathematical function is applied to non-numerical data. I figured the problem might be with my own data since the code worked fine in the demo .RMD file. I noticed that my sc_count object was a sparse rather than dense matrix (and earlier throughout I had gotten warnings speaking to this), so perhaps that was causing a problem? Especially since in the Platform Effects demo code it looked as though that was a dense matrix. However, converting it to a dense object produced the same error. I next verified that no NAs were found in my sc_count or sc_clusterobjects, and that rownames of the full_count_table matched that of sc_cluster.

Is this a familiar error, and might you have recommendations on how I can correct it? Thank you! I'm hopeful that, despite the size and complexity of the MOCA data, I can use this tool to design cool FISH experiments.

Below I include my output of sessionInfo().

R version 4.2.1 (2022-06-23) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base other attached packages: [1] monocle3_1.2.9 SingleCellExperiment_1.18.0 SummarizedExperiment_1.26.1 GenomicRanges_1.48.0 [5] GenomeInfoDb_1.32.3 IRanges_2.30.1 S4Vectors_0.34.0 MatrixGenerics_1.8.1 [9] matrixStats_1.2.0 Biobase_2.56.0 BiocGenerics_0.42.0 cowplot_1.1.1 [13] dplyr_1.1.4 ggdendro_0.1.23 SeuratObject_4.1.3 Seurat_4.3.0 [17] reshape2_1.4.4 pheatmap_1.0.12 boot_1.3-28 splitTools_1.0.1 [21] naivebayes_0.9.7 pROC_1.18.5 caret_6.0-94 lattice_0.20-45 [25] ggplot2_3.4.4 ranger_0.16.0 pagoda2_1.0.11 igraph_1.5.1 [29] Matrix_1.5-3 data.table_1.14.10 gpsFISH_0.1.0 loaded via a namespace (and not attached): [1] utf8_1.2.4 spatstat.explore_3.0-6 reticulate_1.28 R.utils_2.12.3 lme4_1.1-30 [6] tidyselect_1.2.0 RSQLite_2.2.16 AnnotationDbi_1.58.0 htmlwidgets_1.6.1 grid_4.2.1 [11] Rtsne_0.17 munsell_0.5.0 codetools_0.2-18 ica_1.0-3 future_1.33.1 [16] miniUI_0.1.1.1 withr_3.0.0 spatstat.random_3.1-3 colorspace_2.1-0 progressr_0.14.0 [21] knitr_1.42 rstudioapi_0.14 ROCR_1.0-11 tensor_1.5 listenv_0.9.1 [26] urltools_1.7.3 GenomeInfoDbData_1.2.8 polyclip_1.10-4 topGO_2.48.0 bit64_4.0.5 [31] parallelly_1.36.0 vctrs_0.6.5 generics_0.1.3 ipred_0.9-14 xfun_0.37 [36] timechange_0.2.0 R6_2.5.1 DelayedArray_0.22.0 bitops_1.0-7 spatstat.utils_3.0-1 [41] cachem_1.0.7 promises_1.2.0.1 scales_1.3.0 nnet_7.3-17 gtable_0.3.4 [46] globals_0.16.2 goftest_1.2-3 drat_0.2.4 timeDate_4032.109 rlang_1.1.3 [51] splines_4.2.1 lazyeval_0.2.2 ModelMetrics_1.2.2.2 spatstat.geom_3.0-6 brew_1.0-10 [56] yaml_2.3.7 abind_1.4-5 httpuv_1.6.9 tools_4.2.1 lava_1.7.3 [61] sccore_1.0.4 ellipsis_0.3.2 RColorBrewer_1.1-3 ggridges_0.5.6 Rcpp_1.0.11 [66] plyr_1.8.9 zlibbioc_1.42.0 RCurl_1.98-1.8 purrr_1.0.2 dendsort_0.3.4 [71] rpart_4.1.16 deldir_1.0-6 pbapply_1.7-0 zoo_1.8-11 ggrepel_0.9.4 [76] cluster_2.1.4 magrittr_2.0.3 scattermore_0.8 SparseM_1.81 lmtest_0.9-40 [81] triebeard_0.4.1 RANN_2.6.1 fitdistrplus_1.1-8 pkgload_1.3.0 patchwork_1.1.2 [86] mime_0.12 evaluate_0.20 xtable_1.8-4 RMTstat_0.3.1 N2R_1.0.1 [91] gridExtra_2.3 compiler_4.2.1 tibble_3.2.1 KernSmooth_2.23-20 crayon_1.5.2 [96] minqa_1.2.4 R.oo_1.26.0 htmltools_0.5.4 mgcv_1.8-40 later_1.3.0 [101] tidyr_1.3.0 lubridate_1.9.3 DBI_1.1.3 MASS_7.3-58.1 cli_3.6.2 [106] R.methodsS3_1.8.2 gower_1.0.1 pkgconfig_2.0.3 sp_1.6-0 terra_1.6-7 [111] plotly_4.10.1.9000 spatstat.sparse_3.0-0 recipes_1.0.9 foreach_1.5.2 hardhat_1.3.1 [116] XVector_0.36.0 prodlim_2023.08.28 stringr_1.5.1 digest_0.6.33 sctransform_0.3.5 [121] RcppAnnoy_0.0.21 graph_1.74.0 spatstat.data_3.0-0 Biostrings_2.64.1 rmarkdown_2.20 [126] leiden_0.4.3 Rook_1.2 uwot_0.1.16 shiny_1.7.4 nloptr_2.0.3 [131] rjson_0.2.21 lifecycle_1.0.4 nlme_3.1-159 jsonlite_1.8.8 viridisLite_0.4.2 [136] fansi_1.0.6 pillar_1.9.0 GO.db_3.15.0 KEGGREST_1.36.3 fastmap_1.1.1 [141] httr_1.4.5 survival_3.4-0 glue_1.6.2 png_0.1-8 iterators_1.0.14 [146] bit_4.0.4 class_7.3-20 stringi_1.8.3 blob_1.2.3 memoise_2.0.1 [151] irlba_2.3.5.1 future.apply_1.11.1

Thank you for your detailed question! plogis is not a function explicitly called in the package. It must be called by some other functions in the package. To learn more about what is going on, can you provide your code and a small dataset so that I can reproduce the error?

evaknichols commented 6 months ago

Hi @YidaZhang0628 , thank you so much for your fast response and willingness to look into this error. This morning I worked to try and get the code and downsampled data object within Github's allowable limits, but failed. Please see this link to a publicly available google drive folder that has all contents.

The full data cds object (derived from Monocle3) has 265124 cells and 24552 genes; the downsampled version has just 261 cells and 24552 genes.

Importantly, though the downsampled object reproduced the same error described above, working with the downsampled data produced a new error at line 163 (initialize_population step):

Error in base::order(DEG.table$AUC, decreasing = TRUE) : 
  argument 1 is not a vector

Please let me know if I can clarify anything or provide code/data in a better alternative way.

YidaZhang0628 commented 6 months ago

@evaknichols Thank you for the data and code! They are very helpful. I can reproduce the error using your code. The problem is indeed the format of sc_count. Although you converted it from a sparse matrix to a dense matrix, its data type isdgCMatrix instead of matrix array. When I add sc_count = as.matrix(sc_count) after sc_count = exprs(cds), the problem is gone. Although sc_count is not used directly as input for gpsFISH_optimize, it is used when calculating relative_prop. The format of sc_count will affect the format of relative_prop, which caused the error. Let me know if this helps.

evaknichols commented 6 months ago

Hi @YidaZhang0628 , brilliant, that worked for me! Thank you so much. It's really exciting to see the panels.

This is not urgent, since the step is optional. I'm curious if you also reproduced the new error during the initialize_population step with my data/code?: Error in base::order(DEG.table$AUC, decreasing = TRUE) : argument 1 is not a vector

In general is there an advantage to initializing gene panels first versus randomly later on during the gpsFISH_optimize step?

YidaZhang0628 commented 6 months ago

Hi @evaknichols The new error is because some cell types don't have any marker genes, i.e., there is nothing in DEG.table. In your case, 5 out of 8 cell types have no differentially expressed genes based on the downsampled data.

In general, including some cell type marker genes can speed up the optimization (too many can lead to local optimum). But In your case, after downsampling, there are no significant transcriptional differences between most cell types. Therefore, it probably won't make much difference compared to a random initialization. But most importantly, the goal of gpsFISH is to find genes that can identify pre-defined cell types. If the input doesn't have much cell type difference, then it does not make sense to perform gene selection on such a dataset (e.g., the downsampled version in your case) and very likely won't work well.

evaknichols commented 6 months ago

Hi @YidaZhang0628 , thank you so much for this explanation--it makes sense to use our cluster to process the fuller dataset.

Hopefully one last question before I go. I'm trying to understand the structure of the GA object as the output of gpsFISH_optimize step. The step in the demo below produces a list (marker_panel) of 100 genes. This makes sense since I asked for 10 marker genes and I have 10 major classes.

marker_panel = rownames(sc_count)[GA$bestsol]
marker_panel

This might be a basic/naive question but, how can I access the 10 genes on a per cluster label basis from the GAobject?

Thanks again for your responsiveness. Of the tools I've dabbled in over the years, gpsFISH has been among the most accessible for someone like me (primarily wet lab experimentalist :') ).

YidaZhang0628 commented 6 months ago

Hi @evaknichols , It is great to know that you enjoyed using gpsFISH! And don't hesitate to let me know if there is anything I can do to improve the experience.

Back to your question, the reason why marker_panel includes 100 genes in the demo is that the panel_size parameter is set to 100. The way gpsFISH works is to find a given number of genes (determined by panel_size) that can best recognize all cell types in the input data. As to how many genes are needed for each individual cell type, it depends on whether the cell type is easy to identify or not. For obvious cell types, it may only require 1-2 genes. But for sutle cell types, it may require many genes. The method is not trying to select the same amount of genes for each cell type.

For your case, if you only need 10 marker genes, you should set panel_size to 10 and the output will be just 10 genes. As to whether 10 genes are enough to recognize your 10 major classes, it depends on the input data and the separatability of the 10 major classes.

evaknichols commented 6 months ago

Hi @YidaZhang0628 , thanks so much for your clarification! I might have a fundamental misunderstanding about gpsFISH. For my experiment, I'm trying to design cell-type specific RNA FISH panels for a multiplexed experiment in tissue slices such that each class has a different color lighting up. So, having some understanding of which genes in marker_panel correspond to which class would be the best (whether 2, 10, or 20 are sufficient is fine). Is this still possible to do with gpsFISH?

(also, please let me know if this discussion makes more sense to move to DM/email--I wouldn't want to misuse Github's issues pages! Much appreciated).

YidaZhang0628 commented 6 months ago

Hi @evaknichols It is possible to get the information you want especially when your gene panel size is small. After having the gene panel, you can generate the average expression per cell type for selected genes plot. From the heatmap, you should be able to tell which gene is highly expressed in which class. Of note, some genes may be highly expressed in multiple classes.

evaknichols commented 6 months ago

Hi @YidaZhang0628 , thank you for the additional information. I think I have a wrong idea about how gpsFISH is working. I was hopeful that it might penalize genes that are highly expressed in multiple classes (maybe through the weighted penalty matrix?) to enrich for genes that are maximally specific for a class. I'll read the paper more deeply and experiment more to find out how gpsFISH's designed gene panel could be different from one I might manually design from something like Seurat's FindAllMarkers function. Thanks so much for your kind attention!

YidaZhang0628 commented 6 months ago

Hi @evaknichols, if there are very specific marker genes for the cell types you are interested in, then they should be identified by gpsFISH. However, in reality, some cell types may not have unique marker genes. Therefore, things like FindAllMarkers won't work. For these cell types, they can only be identified in a combinatorial way (combinatorial expression of multiple genes). gpsFISH is able to account for this situation, which is a major advantage. But this situation can lead to genes that are not specific to only one cell type. With that being said, if we enforce the genes selected to be specific to one cell type, we will not be able to achieve good performance for cell types that don't have unique marker genes.

evaknichols commented 6 months ago

Hi @YidaZhang0628 --thank you for this clarification! This makes sense and it sounds like gpsFISH is doing the best it can to strike a good balance. Since I'm more interested in "Level 1" cell type mapping, in a mouse embryogenesis context, I probably won't have a problem with not having enough marker genes. I'll use my best judgement to evaluate when a selected gene could be "leaky" (usually, I rely on DotPlots and looking at gene expression in UMAP space). Thanks again for your support.