Parallelization of some QC routines overloads the system

tzeitim commented 2 years ago

I am struggling to run some QC routines like those found in compute_for_mcview.

In particular:

    - compute_inner_fold_factors
    - compute_deviant_fold_factors

I've noticed that you have approached parallelization in a non-naive manner and I am wondering if there is any recommendation to prevent saturation of the system by, what I guess, is an issue of nested parallelization.

Perhaps there is a simple fix, like a global/environment variable or flag to indicate the function to repress its motivation to parallelize so aggressively?

A representative example of the errors produced by any of these functions follows:

...
OpenBLAS blas_thread_init: pthread_create failed for thread 20 of 56: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 2062711 max
OpenBLAS blas_thread_init: pthread_create failed for thread 21 of 56: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 2062711 max
OpenBLAS blas_thread_init: pthread_create failed for thread 22 of 56: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 4096 current, 2062711 max
...

When this happens the kernel dies, of course.

orenbenkiki commented 2 years ago

I'm confused. What is the scenario where you get the crash?

The parallel code in MC2 works hard so that BLAS will not go berserk and nested parallelism will not overwhelm the machine.

The root problem is that parallel mechanism (Python, BLAS) are blissfully unaware of each other, especially when we are using the fork model (which Python does) instead of multi-threading (which Python doesn't "really" support in a meaningful way). So unless you jump through flaming hoops my (which my code goes through), you'll get these crashes every time you run a parallel map and invoke some numpy method in each sub-process.

I have discussed this with relevant people (including the TBB and OpenMP owners in Intel) and this seems inevitable. The only real alternative is to port everything to Julia which does multi-threading (almost) reasonably and has a single global task scheduler which works across all the nested parallel loops as long as you are within the same program. In general Julia would be much more efficient than Python, but switching to Julia would be a violent change.

tzeitim commented 2 years ago

Sorry if I was too general and for my late reply - I wanted to perform some tests first. The workflow is the same as the vignette. All steps working well - I manage to clean my single cells, select my dodgy gene modules to ignore, define forbidden genes and so forth.

My current workflow would then consist of saving the h5d files and then move to MCView in order to explore the results. Only recently I discovered that compute_for_mcview exists and wanted to try it out. A vanilla excution, like the one shown in the vignette threw the BLAS overloading errors I showed before and killed the kernel.

outliers = mc.pl.compute_for_mcview(adata=clean, gdata=metacells, random_seed=123456, compute_var_var_similarity=dict(top=50, bottom=50))

To fine-tune the source of the problem I ran the command above leaving all arguments as None but one. With this examination I think that the problem was triggered by a couple of functions. This is when I wrote to issue.

I think I was faster at writing github issues than at reading code. Right after the raised issue (classic Murphy), I discovered mc.utilities.parallel.set_processors_count. After setting it to a reasonable number (e.g. 4) the BLAS clogging didn't happen any more and I could run compute_for_mcview with all steps allowed.

In case it helps, the information of the server I am working is shown below:

lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                112
On-line CPU(s) list:   0-111
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             4
NUMA node(s):          4
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E7-4850 v3 @ 2.20GHz
Stepping:              4
CPU MHz:               1199.902
CPU max MHz:           2800.0000
CPU min MHz:           1200.0000
BogoMIPS:              4389.18
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108
NUMA node1 CPU(s):     1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61,65,69,73,77,81,85,89,93,97,101,105,109
NUMA node2 CPU(s):     2,6,10,14,18,22,26,30,34,38,42,46,50,54,58,62,66,70,74,78,82,86,90,94,98,102,106,110
NUMA node3 CPU(s):     3,7,11,15,19,23,27,31,35,39,43,47,51,55,59,63,67,71,75,79,83,87,91,95,99,103,107,111
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

I would be happy to help you diagnose further the exact root of the problem is needed - it is a bit strange that this routines trigger the issue and not any other one before in the workflow.

orenbenkiki commented 2 years ago

Let me see if I got it right. You are running inside a Jupyter netbook, you CAN compute MCs (which does a lot of parallel work) and when doing so you do NOT see an issue, but when you run compute for MCView on the results you DO see an issue.

Is that correct?

If so, this is very weird. There is plenty of parallelism (including invoking BLAS in each parallel sub-process) when computing MCs. There's nothing special in the code computing the fold factors which should cause a problem.

tzeitim commented 2 years ago

You got it right and once I limit the CPUs for "compute for mcview" the issue is gone.

It is indeed very strange.

orenbenkiki commented 2 years ago

Ok. Can you run this (both the MC computation and the MCView computation) with debug logging?

Do this:

import metacells as mc
import logging
mc.ut.setup_logger(level=logging.DEBUG)

(Setting up the logger must be the 1st thing after importing the metacells module).

This will be a large log file but it should tell us something.

tzeitim commented 2 years ago

Thanks for instructions.

You can find the log here

(edit: first link was incorrect)

orenbenkiki commented 2 years ago

Link https://gist.github.com/tzeitim/2e35fe1960ebe733ad46fafee472dc10 is still wrong? It shows a dump of a GDB log file, not the log messages emitted by the MC package.

tzeitim commented 2 years ago

Sorry for the confusion, it was a really bad coincidence that the copy+pasting mistake pointed to another MC2 log from an older issue.

For clarity: The correct one: https://gist.github.com/tzeitim/dce4bb9a17c9aa639ebe23f843c8ee5a

The wrong one is: https://gist.github.com/tzeitim/2e35fe1960ebe733ad46fafee472dc10

(I really shouldn't use the github app on the phone)

orenbenkiki commented 2 years ago

Ok, looked at the log. I note the job isn't actually that large, so the reason BLAS didn't crash during the MC computation is that it actually used only a few processes. If you run MC2 on a large data set it will probably crash there as well.

When you get to the MCView computation, it actually does start to use many processes. You can see in the log file that it also (tries to) throttle each one to use only a few threads (or just one thread) in internal calls like BLAS. This obviously failed.

This is done threadpoolctl.threadpool_limits. Now, for us (and all users we had so far), this works fine. However... maybe you have some different BLAS implementation that isn't affected by this?

You can check this by (1) ignoring MC2 altogether, (2) running on your 56-core machine, (3) run some BLAS function (via numpy) and see it uses all 56 cores (run top or htop or something). Then (4) invoke threadpoolctl.threadpool_limits to a lower value and (5) re-run the operation. If I'm correct, you'll see BLAS still uses all 56 cores (again using top or htop or something like that).

If this is the case, then there's an issue between BLAS and threadpoolctl.threadpool_limits - you'd need to take it up with either tool. Can you say something about your setup - are you using vanilla pandas/numpy and the BLAS they bring with them, or are you using some custom version of BLAS?

As a workaround, since your problem is small-ish, as you said, you can use set_processors_count (or METACELLS_PROCESSORS_COUNT) to force MC2 to use only a few cores (or even just one). Not much of a solution, I admit...

tzeitim commented 2 years ago

Thanks for the suggestions and the heads up that this can happen with large datasets. Just for the record, the MC computation doesn't break on slightly larger datasets, in the order of 1e5 cells.

I am not using any custom version of BLAS. I am running all with what conda provides for numpy while MC2, I install using pip.

I will dig deeper into the BLAS issue as you suggested but on the meantime: one thing that I don't understand is the discrepancy of what the MC2 logs show (56 processors) vs what lscpu shows (112) - which happens to be a two-fold difference.

orenbenkiki commented 2 years ago

The 56 vs. 112 processors count is because MC2 only spawns one process per physical core. Using hyper-threads is counter-productive as it doubles the pressure on the caches and gets no benefits in IPC in numeric (AVX/SSE) heavy code.

Using standard conda numpy/BLAS and still seeing the issue... that's scary, I thought I had this problem solved. Can you send me the version numbers of "everything"?

tzeitim commented 2 years ago

Thanks for all the suggestions.I am working on them...

While fetching the version of "everything" I noticed something peculiar. I have the suspicion that even if the conda environment did install a relatively recent openBlAS library version, it seems that numpy found first the BLAS instance natively installed in the server, which I expect is much older than anything conda provides. I will get back once I have all this covered, in the end it might be that the MC2 is perfectly fine but my environment is just a bit crooked and even if all is installed, numpy is looking at the wrong place.

orenbenkiki commented 2 years ago

I looked at #25 and checked my setup. I'm using numpt 1.20.3 - I don't think the numpy version is the issue. But I checked my numpy configuration and what do you know, I'm not using OpenBLAS. So it is possible that threadpoolctl.threadpool_limits doesn't impact OpenBLAS at all.

I'm testing a workaround - setting the OMP_NUM_THREADS environment variable inside my code so the sub-processes will restrict themselves, whichever OpenMP library they are using. I'll report the results.

tzeitim commented 2 years ago

Thanks! Just for the record - I have no specific reason to be using OpenBLAS, it just happened to be the default by conda-forge if I am not wrong. What I learned is that mkl is the default for numpy and should not have the nesting problems but I have not done any tests.

What are you using instead of OpenBLAS? I am curious to see whether those libraries are problematic for me too.

orenbenkiki commented 2 years ago

Seems like I wasn't really using any? :-(

MC2 does fall back to internal parallel reproducible C++ implementation using AVX for cross correlation anyway, which is the main numeric CPU hog - this is less efficient than the hyper-optimized versions, but, hey, reproducible results. If you don't run the reproducible code (don't sp[ecify a random seed) then it uses "whatever numpy provides" for cross correlations. Other operations (which seem to be reproducible out of the box) go directly to numpy. Side rant: there's no documentation anywhere about what is and what isn't reproducible in numpy, but this system has worked for us so far - I'm starting to worry it might be very dependent on the underlying linear algebra implementation in numpy, and I do NOT want to reimplement everything like I had to do for cross correlation... it seems the numpy people (and the BLAS people) don't care about reproducibility at all, even when it is trivial to achieve even in parallel code, which is pretty amazing for such basic libraries.

At any rate, I'm having lots of (not) fun trying to set up a conda environment with openBLAS and numpy together, will report results when I'll eventually get all this running and tested.

tzeitim commented 2 years ago

I had a lot of the same fun yesterday - you might want to have a look at the environment I made yesterday where it was 'trivial' (after many attempts). My understanding is that to achieve that one would need to : 1 ) install OpenBLAS in the same conda environment as numpy 2) AND install numpy via conda (conda-forge as channel) while avoiding any pip-driven numpy installation

Those two conditions gave me the solution (in my system). Thought to myself: Probably time to switch to guix instead of conda?

orenbenkiki commented 2 years ago

Ok. After much wailing and gnashing of teeth I got to this point:

>>> import numpy as np
>>> np.__version__
'1.21.6'
>>> np.show_config()
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/obk/anaconda3/envs/openblas/lib']
    include_dirs = ['/home/obk/anaconda3/envs/openblas/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/obk/anaconda3/envs/openblas/lib']
    include_dirs = ['/home/obk/anaconda3/envs/openblas/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/home/obk/anaconda3/envs/openblas/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/obk/anaconda3/envs/openblas/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/obk/anaconda3/envs/openblas/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

So it claims to be using the lapack libraries which are using blas in them. I think I can replicate how I got here but for now I'm placing it in the repressed traumatic memories bin.

So much for the good news. The bad news is that, well, things are weird (again).

To start with, this doesn't run noticeably faster than my version w/o BLAS. Which either means it isn't really using BLAS, or that the default BLAS-less version of Numpy is pretty well optimized, at least for the simple functions I invoke. Or something...

I'm running on a 48-physical-cores machine (96 logical cores) with a lot of memory (0.5TB). I'm using a reasonably large data set (~350K cells in the original data).

I do see that the OS allocates an unholy number of threads when the compute_for_mcview is running. But, at the same time, I don't see that they are actually used at any point in time. That is, it seems that threadpoolctl.threadpool_limits does successfully restrict the actual used threads? Maybe...

The memory usage is non-trivial, as expected for such a data set - I get up to 60GB when running compute_for_mcview on this data. General point: if you have a machine with a lot of CPUs and not a lot of memory (modern desktops tend to be that way), you should really consider setting either METACELLS_PROCESSORS_COUNT (controls parallelism in general) and/or METACELLS_PARALLEL_PILES (controls only how many piles we compute in parallel, allowing other things to be more parallel) so that your memory usage will not exceed what the machine has. That has nothing to do with the squared-parallelism issue. Maybe that's the actual problem you run into? I really hope so...

At any rate, I tried dynamically setting the OS environment variables (OMP_NUM_THREADS and also MKL_NUM_THREADS just for good measure) to the proper restricted value before invoking get_context("fork").pool(PROCESSORS_COUNT) and as the 1st thing in each spawned process, in the hope that the libraries will re-read this in the forked processes.

That had no discernable effect. That is, I still see the OS allocating an unholy number of threads, and I still also see it only actually runs the correct number, and the memory usage hasn't changed.

So... not sure what to tell you at this point. I'll leave the setting of the environment variables in the code for the next version(s), just in case it will do someone good for some implementation of the linear algebra in numpy - belts and suspenders and plate armour and shield and all that - but I can't claim it solved your problem.

tzeitim commented 2 years ago

Thanks for your time, Oren. Luckily the problem for me can be easily circumvented (so far) and perhaps this issue being googleable will save time to some people in the future.

tzeitim commented 1 year ago

Just a quick note that one should keep an eye from pip touching numpy. I had fallen to wrong library linkage after upgrading some other packages via pip which in turn changed numpy.

A sanitation of the numpy installation by forcing a conda-based one fixed the issue and now numpy points to the right place when invoking something like np.show_config().

orenbenkiki commented 1 year ago

On an unrelated note, I upgraded some R packages in my environment the other day and this seems to have caused Python's numpy to lose track of its proper installation, forcing me to completely uninstall and reinstall it. It seems numpy installations are fragile in general, nothing to do with the metacells package as of itself. I'm markinng this as closed for now...

tzeitim commented 1 year ago

Hi Oren, I (v0.9.0-dev.1) am still struggling with this issue so I wanted to explore what could be the root of it.

I wanted if you could quickly instruct me on how here you monitoring the OS for allocated but unused threads as when you said:

I do see that the OS allocates an unholy number of threads when the compute_for_mcview is running. But, at the same time, I don't see that they are actually used at any point in time. That is, it seems that threadpoolctl.threadpool_limits does successfully restrict the actual used threads? Maybe...

orenbenkiki commented 1 year ago

I'm just running top or htop (tree view is especially useful) or even ps, nothing fancy.

sophie-xhonneux commented 1 year ago

I also get this issue running mc.pl.divide_and_conquer_pipeline()`` on a reasonably beefy machine with 32 cores and 128GB of memory. I tried setting OPENBLAS_NUM_THREADS=1, but no luck, any idea as to what might help?

orenbenkiki commented 1 year ago

Try to reduce max parallel piles.

On Fri, 23 Jun 2023 at 21:58 Sophie X @.***> wrote:

I also get this issue running mc.pl.divide_and_conquer_pipeline()`` on a reasonably beefy machine with 32 cores and 128GB of memory. I tried setting OPENBLAS_NUM_THREADS=1, but no luck, any idea as to what might help?

— Reply to this email directly, view it on GitHub https://github.com/tanaylab/metacells/issues/24#issuecomment-1604729868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQUXSM5KNJM5YDQJLG2JDXMXRONANCNFSM5U3V52WA . You are receiving this because you modified the open/close state.Message ID: @.***>

sophie-xhonneux commented 1 year ago

I still get the error even with max parallel piles being less than the number of cores.

orenbenkiki commented 1 year ago

There’s a guess max parallel piles function that tries to estimate a value for you. It isn’t guaranteed to work because the dependency on the dataset isn’t straightforward. If that guess is too high, manually reduce the parallelism until it works. 1 “should” always work but of course will be slow.

What are the sizes of your data? Are you using a sparse matrix to hold the UMIs (typically you should as there are many zeros in it)?

On Fri, 23 Jun 2023 at 22:32 Sophie X @.***> wrote:

I still get the error even with max parallel piles being less than the number of cores.

— Reply to this email directly, view it on GitHub https://github.com/tanaylab/metacells/issues/24#issuecomment-1604778476, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQUXVADSS7LSQGUQB2XHLXMXVLNANCNFSM5U3V52WA . You are receiving this because you modified the open/close state.Message ID: @.***>

sophie-xhonneux commented 1 year ago

The guess max parallel piles function guesses something like 2093, so I set it manually to 30 (with 32 cpu cores) and am now testing with 10. If it still fails, I guess I can try 1, but I fear it may be too slow given the size of the dataset.

The dataset is quite large at about 7.9 million cells and indeed the data is held in a sparse matrix.

orenbenkiki commented 1 year ago

Hmmm, it is surprising that you get such a high guess. At any rate it won’t use more than the number of cores. Try lower, 16, 8, …

7.9 is a lot. AnnData doesn’t do memory mapping so it reads it all to memory for no good reason (this is also very slow to load). We are working on better solutions for this but it will be several months at least before they will be operational.

On Fri, 23 Jun 2023 at 22:45 Sophie X @.***> wrote:

The guess max parallel piles function guesses something like 2093, so I set it manually to 30 (with 32 cpu cores) and am now testing with 10. If it still fails, I guess I can try 1, but I fear it may be too slow given the size of the dataset.

The dataset is quite large at about 7.9 million cells and indeed the data is held in a sparse matrix.

— Reply to this email directly, view it on GitHub https://github.com/tanaylab/metacells/issues/24#issuecomment-1604801710, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAQUXWPAL247AXIKWWR5FLXMXW6PANCNFSM5U3V52WA . You are receiving this because you modified the open/close state.Message ID: @.***>

sophie-xhonneux commented 1 year ago

Yeah the dataset size is quite large, hence I was quite glad when I found this library! Thank you for the hardwork!

I know get a different error that says, any idea what might be the cause, it's not super easy to deduce for me without going through the library line by line:

Traceback (most recent call last): File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, kwds)) File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/utilities/parallel.py", line 246, in _invocation return index, PARALLEL_FUNCTION(index) File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/pipeline/divide_and_conquer.py", line 1800, in compute_pile_metacells results = _compute_pile_metacells(pile_index) File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/pipeline/divide_and_conquer.py", line 1749, in _compute_pile_metacells compute_direct_metacells( File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/utilities/logging.py", line 373, in wrapper return function(*args, *kwargs) File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/pipeline/direct.py", line 348, in compute_direct_metacells tl.compute_obs_obs_knn_graph( File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/utilities/logging.py", line 373, in wrapper return function(args, kwargs) File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/tools/knn_graph.py", line 98, in compute_obs_obs_knn_graph return _compute_elements_knn_graph( File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/tools/knn_graph.py", line 244, in _compute_elements_knn_graph outgoing_ranks = _rank_outgoing(similarity) File "/home/mila/s/sophie.xhonneux/.conda/envs/biomultiomics/lib/python3.10/site-packages/metacells/tools/knn_graph.py", line 278, in _rank_outgoing assert np.sum(np.diagonal(outgoing_ranks) == size) == size AssertionError

orenbenkiki commented 1 year ago

Ok, that "shouldn't happen", unless the data is pathological in some sense.

Are you using version 0.8 (the latest published)? If so, it would be better if you tried the head version from github (which is now in "really just about to be released" as 0.9).

Also, if you could increase the log level to DEBUG (by mc.ut.get_logger().setLevel(logging.DEBUG), this will genenrale a lot of logging messages, which may help pinpoint the problem.

sophie-xhonneux commented 1 year ago

Hello, I set the max parallel piles to 1 and upgraded to version 0.9 and I still get a system overload after about processing 45% of the data. I limitted the number of threads in every way I could conceive before importing any other library:

os.environ["OMP_NUM_THREADS"] = "1" # export OMP_NUM_THREADS=1 os.environ["OPENBLAS_NUM_THREADS"] = "1" # export OPENBLAS_NUM_THREADS=1 os.environ["MKL_NUM_THREADS"] = "1" # export MKL_NUM_THREADS=1 os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # export VECLIB_MAXIMUM_THREADS=1 os.environ["NUMEXPR_NUM_THREADS"] = "1" # export NUMEXPR_NUM_THREADS=1

I am out of ideas. Do you have any idea?

tzeitim commented 1 year ago

@lpxhonneux What works for me is:

import metacells as mc

cpus = 1 # I usually hover between 4-8

mc.utilities.parallel.set_processors_count(cpus)

sophie-xhonneux commented 1 year ago

I tried that but it becomes unacceptably slow for the size of my dataset (7.8 million cells), also not using 63 available CPUs feels terrible.

orenbenkiki commented 1 year ago

You can play independently with the number of parallel piles (that affects memory usage) and the number of processors (which, in theory, shouldn't). Of course using less parallel piles uses the processors "less" but it still (should) use them. Also, if you have 32 physical CPUs and 64 logical processors, you should only use 32 parallel processors, as the hyper-threading doesn't gain you anything (in fact it actually hurts performance). So using very low parallel piles (1 - 4, say) together with high processors (32) "should" work.

tanaylab / metacells

Parallelization of some QC routines overloads the system #24