y variable has to contain at least one observation per class for estimation process

kharchenkolab / gpsFISH

Optimization of gene panels for targeted spatial transcriptomics

Other

7 stars 1 forks source link

y variable has to contain at least one observation per class for estimation process #17

Closed pakiessling closed 1 year ago

pakiessling commented 1 year ago

Hi @YidaZhang0628 ,

thanks to all of your help I now have gpsFISH running quite nicely.

For the evaluation of a panel I tried to increase the number of cross-validations from 5 to 10. I then received the error: y variable has to contain at least one observation per class for estimation process

Thanks!

YidaZhang0628 commented 1 year ago

It is good to know that you can run gpsFISH on your dataset. For your question, what is the command you used and the size (number of cells) of your smallest cell type after subsampling?

pakiessling commented 1 year ago

This is the command:

result <- fitness(string=index_list,
        full_count_table=as.data.frame(t(sc_count)),
        cell_cluster_conversion=sc_cluster
        ,nCV=10
        , rate=0.15,
        cluster_size_min=50,
        relative_prop=relative_prop,
        two_step_sampling_type= c('Subsampling_by_cluster''No_simulation')

)

I am unsure how to retrieve the number of cells after fitness() subsamples. The smallest number before subsampling is 718.

pakiessling commented 1 year ago

I just noticed that fitness() wants genes as columns, exactly the other way around than gpsFISH_optimize(), my mistake. Edit: This results in

Error in fitness(string = index_list, full_count_table = as.data.frame(sc_count),  : 
  'full_count_table' should have the same row name with 'cell_cluster_conversion'

I guess the documentation on full_count_table must be wrong:

full_count_table | A data frame containing the expression level of each gene in each cell with gene name as row name and cell name as column name. -- | -- cell_cluster_conversion | A data frame with each row representing information of one cell. First column contains the cell name. Second column contains the corresponding cell type name. Row name of the data frame should be the cell name.

YidaZhang0628 commented 1 year ago

Thank you for pointing this out. You are right that the documentation on full_count_table is wrong. You should have cells as rows and genes as columns. I have updated it. Sorry for the confusion.

pakiessling commented 1 year ago

@YidaZhang0628 No problem, any idea about the y variable thing? Can I make gpsfish print what it is doing with the subsampling somehow?

YidaZhang0628 commented 1 year ago

If you increase rate and cluster_size_min, is the error still there? Regarding the subsampling part, it is simply the original cell type size times the rate and adjusted by the lower and upper bound. In your case, the lower bound is 718*0.15 which is about 108 cells. This should be enough for 10 cross-validations. If increasing rate and cluster_size_min doesn't solve the problem, can you share the file to reproduce this error? I can take a look.

pakiessling commented 1 year ago

@YidaZhang0628 thank you so much, I will try increasing the parameters first

pakiessling commented 1 year ago

@YidaZhang0628 even after doubling the subsample - same error :(

Here is the code im running: https://github.com/pakiessling/misc/blob/main/gpsfish.R Here is the dataset (600 MB) https://rwth-aachen.sciebo.de/s/wpNMOYlxOqXiKtH

YidaZhang0628 commented 1 year ago

I took a look at the code and found that this is caused by a mismatch between cross-validation names when there are 10 or more cross-validations. I have fixed this issue. If you re-install the main version, you will be able to run your code without a problem.

pakiessling commented 1 year ago

Perfect. Thank you!