satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.28k stars 909 forks source link

FindMarkers() gets stuck when logfc.threshold and min.pct are set to less stringent values #5227

Closed l-cli closed 1 year ago

l-cli commented 3 years ago

Hi,

I am currently running the FindMarkers() function on a Seurat object that has 36601 genes and ~18000 cells with the following setup:

Idents(t) <- "time.ident"
zero_one <- FindMarkers(t, 
                          ident.1 = "1hr", 
                          logfc.threshold = 0.2, 
                          min.pct = 0,
                          ident.2 = "0hr", 
                          test.use = "LR",
                          verbose = TRUE)

Our goal is to compare the cells that are in the "1hr" group with the "0hr" group. The problem is when we tried to run the codes above, the program was stuck and did not produce any output after waiting for more than 48hrs. We are using Future to generate multiple threads, and the codes above were run with 6 workers simultaneously. Moreover, when we tried a logfc.threshold of 0.25 and default min.pct, we were able to see some output, but we want to include all the genes instead of just a subset. We would appreciate any suggestions on how to solve this issue, thank you!

liu-xingliang commented 2 years ago

My understanding is when running in parallel, the progress bar is not working. It would be good to estimate the time required by running the code in single core mode first, and then try to run in parallel mode.

Also, tuning down min.pct to 0 to keep all genes may not be a good practice, because the default 0.1 (10% expressed cells in at least one group) makes sure we are working on valid expressed genes. You may want to tune down logfc.threshold = 0 to keep all expressed genes without cutting them off because of fold-change, that's usually give you more than 10k genes, it would be enough for any analysis requiring "whole" gene list, for example, GSEA.

l-cli commented 2 years ago

Thank you for the advice! My follow up question would be if FindMarkers() simply do not support parallel processing and does not proceed if done so. We are now running the above codes with min.pct = 0.1 and without parallel processing, and the estimated time is 15min, but when we were using parallel with future, it ran overnight but did not finish.

My understanding is when running in parallel, the progress bar is not working. It would be good to estimate the time required by running the code in single core mode first, and then try to run in parallel mode.

Also, tuning down min.pct to 0 to keep all genes may not be a good practice, because the default 0.1 (10% expressed cells in at least one group) makes sure we are working on valid expressed genes. You may want to tune down logfc.threshold = 0 to keep all expressed genes without cutting them off because of fold-change, that's usually give you more than 10k genes, it would be enough for any analysis requiring "whole" gene list, for example, GSEA.

mhkowalski commented 2 years ago

Hi, FindMarkers should support parallel processing, as shown here. Can you provide the code you are using to run future?

Apologies if you did this already but it's not quite clear to me. Did you try running with min.pct = 0.1 with future, in which case the run time should be less than 15 minutes?

l-cli commented 2 years ago

Hi! Here are the codes we used to run future:

# set threads and parallelization
plan("multisession", workers = 6)
options(expressions = 20000)
options(future.globals.maxSize = 21474836480)

We did run min.pct = 0.1 without running future, and it took us about a day to finish running.

jzhou88 commented 2 years ago

Hi Seurat people,

The same thing happens here. I have a FindMarkers job that took about 6 hours for Seurat v3 with 16 cores to complete. Now, using Seurat v4 with 16 cores, the same job takes forever to finish. If possible, would you please check this issue? Thanks.

Best, J

tilofrei commented 1 year ago

Hey, maybe checkout presto: It significantly speeds up DEG calculation with the Wilcox test for me and is integrated through Seurat wrappers.

saketkc commented 1 year ago

Hi @tilofrei You are using "LR" based test which is expected to be slow (particularly with low thresholds such as 0.2 that you have). You could switch to wilcoxon test (which is also the default).

flde commented 1 year ago

I have the same problems on 10.000 x 65.000 genes x cells matrix running with 200G and 42 CPU. The job dies with out of memory handler while using the wilcoxon test ptc.1=0 (Seurat v4). Maybe I am ignorant but is this expected?