diff_gene_cluster: forking not supported on Windows

kharchenkolab / gpsFISH

Optimization of gene panels for targeted spatial transcriptomics

Other

7 stars 1 forks source link

diff_gene_cluster: forking not supported on Windows #20

Closed simomounir closed 6 months ago

simomounir commented 1 year ago

Hi again,

I am trying to run the diff_gene_cluster method in order to initialize the population of genes used for panel creation. Below is what I encountered:

_diff_expr=suppressMessages(diff_gene_cluster(pagoda_object = adjust_variance$pagoda.object, cell_cluster_conversion = cell_cluster_conversiondf, n.core = 1))

Error in mclapply(..., mc.cores = n.cores, mc.preschedule = mc.preschedule) : 'mc.cores' > 1 is not supported on Windows

I tried forcing the n.core=1 argument but it still prompts the same error

Is there any work-around for Windows users?

Thanks in advance for your help.

Cheers

simomounir commented 1 year ago

Hi, me again

Is there any way to change the implementation so it does not forcibly call mclapply? Maybe an argument to only use n.core=1? I tried passing n.core=1 as an argument but it still triggers the same error.

This function causes some pitfalls for Windows users. Please do let me know :).

Thanks in advance,

Cheers.

YidaZhang0628 commented 1 year ago

Sorry for the delay. I am having a really busy week and will try to take a look at it early next week.

YidaZhang0628 commented 1 year ago

I took a look at the code. diff_gene_cluster doesn't use mclapply. The adjust variance step before that based on the function preprocess_normalize uses a function in the Pagoda2 package, which uses parallel computing. As a quick fix, you can use packages other than Pagoda2 (e.g., Seurat) to perform normalization and differential expression. This way, you can skip preprocess_normalize and diff_gene_cluster. All you need is to make sure the output from Seurat has the same format as diff_expr_result in function initialize_population. I will work on an alternative for Windows users in the near future.

Boehmin commented 7 months ago

Follow up question on the diff_gene_cluster: I am running this on our snRNA-seq data (so far no issues) and all worked well. However, when I got to this line it just kept computing and computing: diff_expr=suppressMessages(diff_gene_cluster(pagoda_object = adjust_variance$pagoda.object, cell_cluster_conversion = my_sc_cluster, n.core = 20))

I started using our Large HPC with 28 core and 180GB RAM, since I thought maybe the 90GB might not be enough. There is the option to try an even larger configuration, but I am stumped by how long it takes (it has been running for over 30min). Is this step also a bottleneck for you?

Maybe to add, I want to predict probes based on snRNAseq data. We don`t have spatial data yet.

Cheers

edit: With 270GB RAM, 42cores I managed to get it to run in a reasonable time. I had to assign 40 cores however.

YidaZhang0628 commented 7 months ago

Hi @Boehmin, if you have a relatively big or complex (many cell types) dataset, the diff_gene_cluster function can take a long time and a lot of RAM. In essence, what this step does is to find marker genes for each cell type. One alternative is to try other marker gene identification methods such as FindAllMarkers from Seurat, which could be faster. Just to make sure to organize the result in the same format as the output of diff_gene_cluster. One more note, make sure to save diff_expr so that you don't need to calculate it again if you need to run the optimization multiple times.