kharchenkolab / gpsFISH

Optimization of gene panels for targeted spatial transcriptomics
Other
7 stars 1 forks source link

Evaluating existing panels #14

Closed pakiessling closed 1 year ago

pakiessling commented 1 year ago

Hi, thanks for the tool - it looks really nice.

Can I use gpsfish to evaluate a panel of genes that was not selected by it?

YidaZhang0628 commented 1 year ago

Yes, you can use it to evaluate the performance of a gene panel. There are multiple ways to do it. The simplest way will be to use the fitness function. It is the function we used to evaluate the performance of a given gene panel. Another way is to run gene panel selection by following the tutorial. However, instead of randomly initializing an initial population, you initialize a population with every gene panel equal to the gene panel you want to evaluate (e.g., a population with two same gene panels). Then you run gpsFISH_optimize for one iteration. One minus the outputted fitness value will be the accuracy of your gene panel. Hope this helps and let me know if you have further questions.

pakiessling commented 1 year ago

Perfect, looking forward to trying it out

pakiessling commented 1 year ago

Thank you for all of your help @YidaZhang0628

Looking at fitness it is not entirely clear to me what string: A numeric vector containing the gene panel. means or how I can get it from my list of genes. Is this just an index of my genes?

Also, do you find 5 a good starting point for cross validation, like in your publication?

YidaZhang0628 commented 1 year ago

Sorry for not making it clear. You are right, string is just the index of your genes. Specifically, it is the location of your genes in rownames(full_count_table). I will update the document in the next round of update. Thank you for pointing this out.

5 should be a good starting point. If you are just evaluating one given panel, you can use more cross validations because you are not doing multiple rounds of optimization.

pakiessling commented 1 year ago

@YidaZhang0628 Perfect, thanks a lot.

pakiessling commented 1 year ago

Hi @YidaZhang0628, sorry to bother you again. This time my question is about the relative_prop parameter.

Am I right in assuming that Seurat's AverageExpression(dataset, group.by="cell_type") and AverageExpression(dataset) on a normalized and scaled dataset would return the right values? I am unfortunately quite inexperienced in the R single cell workflow.

YidaZhang0628 commented 1 year ago

I am not familiar with the functions you mentioned but if it is based on normalized and scaled datasets, it is probably different from what gpsFISH needs. The gene panel selection tutorial has a section about how to calculate relative_prop from sc_count. You can follow that to calculate relative_prop.

pakiessling commented 1 year ago

Sorry @YidaZhang0628 , but I once more need your help 😅

I am now trying out fitness on the tutorial dataset. When I run it in "Simulation" mode everything works perfectly, but "No_Simulation" causes an error: Code:

fitness(

string=index_list,
gene_list=marker_panel,
cell_list=cell_list,
cell_cluster_conversion=sc_cluster,
nCV=5,
relative_prop = relative_prop,
    two_step_sampling_type = c("Subsampling_by_cluster", "No_simulation"),
    cluster_size_min = 20,
   # simulation_parameter=simulation_params,
   # sample_new_levels = "old_levels",
)

Error:

Error in base::colSums(spatial_sc_count): 'x' must be an array of at least two dimensions
Traceback:

1. fitness(string = index_list, gene_list = marker_panel, cell_list = cell_list, 
 .     cell_cluster_conversion = sc_cluster, nCV = 5, relative_prop = relative_prop, 
 .     two_step_sampling_type = c("Subsampling_by_cluster", "No_simulation"), 
 .     cluster_size_min = 20, )
2. lapply(cvround, classifier_per_cv, cvlabel = cvlabel, gene_list = candidate_gene_panel_loc, 
 .     cell_list = subsample_cell_loc, class_label_per_cell = class_label_per_cell, 
 .     metric = metric, method = method, RF_num_threads = RF_num_threads, 
 .     relative_prop = relative_prop, sample_new_levels = sample_new_levels, 
 .     use_average_cluster_profiles = use_average_cluster_profiles, 
 .     simulation_type = two_step_sampling_type[2], simulation_parameter = simulation_parameter, 
 .     simulation_model = simulation_model, cell_cluster_conversion = cell_cluster_conversion, 
 .     weight_penalty = weight_penalty)
3. FUN(X[[i]], ...)
4. base::colSums(spatial_sc_count)
5. stop("'x' must be an array of at least two dimensions")
YidaZhang0628 commented 1 year ago

Can you send me the data that I can use to reproduce this error?

pakiessling commented 1 year ago

@YidaZhang0628

I get this error with data(sc_count) as well as with my own data.

You can find the code I ran here (gps-fish from github, dev version):

https://github.com/pakiessling/misc/blob/main/gpsfish_tutorial.ipynb

YidaZhang0628 commented 1 year ago

From the code, it seems that you are using the development version. Unfortunately, to increase the efficiency of code, we don't have "no simulation" option for fitness in the development version. If you want to evaluate fitness without simulation, you can try the main version.

pakiessling commented 1 year ago

@YidaZhang0628 Ok, good to know.

Does that mean gpsFISH will not support panel selection without simulation in the future? E.g. I can't use it if I don't already have a spatial reference for the simulation?

YidaZhang0628 commented 1 year ago

We will implement a no-simulation option for gpsFISH in the future.