Open rmitsch opened 6 years ago
I agree that this will be useful. I'll have to check into how to actually make this an option -- ideally some sort of package level configuration option?
On Tue, Jul 17, 2018 at 11:59 AM Raphael Mitsch notifications@github.com wrote:
Since I'm interested in comparing multiple UMAP results, I tried to run one UMAP instance per thread - which led to all of them being deadlocked. After some digging I found that changing the Numba decorators in umap_.py from @numba.njit(parallel=True, fastmath=True) to @numba.njit(parallel=False, fastmath=True), i. e. disabling the Numba parallelism, solved the deadlock. Perhaps this is related to some open Numba thread coordination/deadlock issues (see e. g. numba/numba#2804 https://github.com/numba/numba/issues/2804).
A configuration option to disable Numba parallelism would be nice, since it enables the user to decide whether he want to use Numba to parallelize or do it manually (e. g. running one UMAP model per thread) or not at all.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/86, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBX9qDGMwtrhzGK4sLFWk17ZJ20p8ks5uHgnigaJpZM4VTH-F .
I'd consider an instance-level/constructor parameter as more useful - like sklearn does it with its n_jobs
parameter e. g. for the RandomForestClassifier
. I'm not familiar enough with Numba though to know whether the annotation framework is flexible enough for that.
On a sidenote: Stuff like https://stackoverflow.com/questions/46009368/usage-of-parallel-option-in-numba-jit-decoratior-makes-function-give-wrong-resul makes me question how reliable Numba's automatic parallelism is. But as already mentioned I don't have a lot of Numba experience, so I can't really assess that.
That would definitely be preferable -- I'm not sure whether that is technically feasible. The package level thing is more likely yo be possible. I'll look into what cna actually be done.
On Tue, Jul 17, 2018 at 5:11 PM Raphael Mitsch notifications@github.com wrote:
I'd consider an instance-level/constructor parameter as more useful - like sklearn does it with its n_jobs parameter e. g. for the RandomForestClassifier. I'm not familiar enough with Numba though to know whether the annotation framework is flexible enough for that.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/86#issuecomment-405729338, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBU24u48aGoJoOzc-TAnbh93M2oMcks5uHlL8gaJpZM4VTH-F .
https://github.com/numba/numba/pull/3202 is WIP to address threadsafety in the Numba parallel backend.
Thanks @stuartarchibald! Looking forward to the results of that.
@lmcinnes Numba 0.40.0 rc1 was published last week if you want to give it a try https://groups.google.com/a/continuum.io/forum/#!topic/numba-users/pYAd-kT1mDM? The official release will be published shortly.
Thanks @stuartarchibald ! I had been playing with some of the options of it, but mostly just little experiments. I haven't tried the thread safety yet, but I should. Thanks for the reminder.
@lmcinnes great, thanks for trying it out. The relevant docs from the dev builds are rendered here: http://numba.pydata.org/numba-doc/dev/user/threading-layer.html, these should provide some guidance on use. Any questions/problems, core devs are on gitter or feel free to open issues etc. Thanks again.
Running pip install tbb
can help for thread safety.
Or setting https://numba.pydata.org/numba-doc/dev/reference/envvars.html#envvar-NUMBA_DISABLE_JIT to disable jit entirely.
@rmitsch Let me know if this helps you and if I can close the issue.
This would be a really useful feature. Did anyone find a natural way of parametrizing the parallel decorator? We could write our own decorator that calls the decorated or undecorated function based on an option in the constructor as suggested by @rmitsch.
Running
pip install tbb
can help for thread safety. Or setting https://numba.pydata.org/numba-doc/dev/reference/envvars.html#envvar-NUMBA_DISABLE_JIT to disable jit entirely.
Just for reference there's extensive documentation on selecting an appropriate threading layer for your application here: http://numba.pydata.org/numba-doc/latest/user/threading-layer.html#selecting-a-threading-layer-for-safe-parallel-execution
If in doubt use TBB as it's safe in all parallel paradigms across all platforms.
Thanks @stuartarchibald. I'm trying to fit multiple UMAP objects using joblib.Parallel
and the following env vars:
NUMBA_NUM_THREADS=150
NUMBA_THREADING_LAYER=tbb
I'm getting errors using both prefer='threads'
and prefer='processes'
with joblib.Parallel
(below). I also noticed that with 'threads'
it seems that only one core is being used, but maybe that's an issue with joblib
?
I have tbb==2020.0.133
installed.
joblib.Parallel(prefer='threads')
EDIT: I just noticed that this was because of a typo (I used 'tbb'
instead of tbb
). I can use prefer='threads'
without errors, but I still see that only one core is used and I get the following warning:
joblib.Parallel(prefer='processes')
Pinging @lmcinnes in case this helps polish up the various issues with parallelization.
Thanks @stuartarchibald. I'm trying to fit multiple UMAP objects using
joblib.Parallel
and the following env vars:
Thanks for this.
NUMBA_NUM_THREADS=150 NUMBA_THREADING_LAYER=tbb
^ that's a lot of threads, have you got hardware with 150 physical cores?
I'm getting errors using both
prefer='threads'
andprefer='processes'
withjoblib.Parallel
(below). I also noticed that with'threads'
it seems that only one core is being used, but maybe that's an issue withjoblib
?I have
tbb==2020.0.133
installed.Traceback from
joblib.Parallel(prefer='threads')
EDIT: I just noticed that this was because of a typo (I used
'tbb'
instead oftbb
). I can useprefer='threads'
without errors, but I still see that only one core is used and I get the following warning:
I also noticed that before the **EDIT**
that this failed somewhat ungracefully, LoweringError
when the problem was a ValueError
for 'tbb'
being invalid? Will try and replicate + fix.
/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible. To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help. File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/nndescent.py", line 47: @numba.njit(parallel=True) def nn_descent( ^ state.func_ir.loc)) /home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible. To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help. File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/nndescent.py", line 47: @numba.njit(parallel=True) def nn_descent( ^ state.func_ir.loc))
The above suggests that no parallel transform was made, which may be why you only see 1 core in use?
Traceback from
joblib.Parallel(prefer='processes')
The <details>
of the above suggest that _launch_threads
failed badly, which I think is fixed in 0.47.0 (the context manager was basically broken on some platforms), fix was in https://github.com/numba/numba/pull/4755/commits/15e75688362e97352cd2102e1f880a8f77dd7471 and merged as part of https://github.com/numba/numba/pull/4755. Hopefully, if you update to 0.47.0 it'll permit the actual error message/problem to be discovered.
Pinging @lmcinnes in case this helps polish up the various issues with parallelization.
^ that's a lot of threads, have you got hardware with 150 physical cores?
The hardware I'm using has 160 logical CPUs: 4 sockets of 20 physical cores, each capable of running 2 threads. Should I limit numba's parallelism to just the 80 physical cores?
I also noticed that before the EDIT that this failed somewhat ungracefully, LoweringError when the problem was a ValueError for 'tbb' being invalid? Will try and replicate + fix.
Yes, it didn't fail gracefully; it looked like the error was the same as in the prefer='processes'
case.
Hopefully, if you update to 0.47.0 it'll permit the actual error message/problem to be discovered.
Thanks, I'll give this a shot!
^ that's a lot of threads, have you got hardware with 150 physical cores?
The hardware I'm using has 160 logical CPUs: 4 sockets of 20 physical cores, each capable of running 2 threads. Should I limit numba's parallelism to just the 80 physical cores?
I'm not sure what the performance profile of your code is, but it's probably worth developing a performance testing harness that stresses a real-world case and then incrementally moving the thread count nearer to physical core count and seeing what happens.
I also noticed that before the EDIT that this failed somewhat ungracefully, LoweringError when the problem was a ValueError for 'tbb' being invalid? Will try and replicate + fix.
Yes, it didn't fail gracefully; it looked like the error was the same as in the
prefer='processes'
case.Hopefully, if you update to 0.47.0 it'll permit the actual error message/problem to be discovered.
Thanks, I'll give this a shot!
Great, thanks, let us know how you get on!
Is there any update on this? I’m running UMAP in loop for hundreds of different datasets and for some reason it is not using more than one thread.
I’d be great if I could disable Numba parallelism to use joblib instead.
@apcamargo Using numba.config.THREADING_LAYER = 'tbb'
pretty much solved the problem for me. Same use-case of generating 10^2 to 10^3 UMAP embeddings in parallel.
Thank you @rmitsch! I've tested your suggestion but the execution times remained more or less the same. In this case I think it's safer to not use joblib because I don't want to add an additional dependency (tbb
) and I suspect it is Intel-only.
I'll try to figure out why the default parallel implementation is not working for me.
@apcamargo tbb
is available on all platforms.
Didn't know that!
What I did was set NUMBA_NUM_THREADS
to 1
and THREADING_LAYER
to 'tbb'
and put the loop inside a joblib helper class. For some reason I didn't observe any speed improvements
Currently if you don't have pynndescent installed and fix a random seed you should avoid all of the numba induced parallelism.
Currently if you don't have pynndescent installed and fix a random seed you should avoid all of the numba induced parallelism.
I’m using a fixed seed. It’s strange that joblib is not improving the speed at all.
It could be that the memory moving cost is more than the CPU cost for some of the operations. I am honestly not sure. As others noted tbb and NUMBA_NUM_THREADS=1
should also make things work.
I tried in a cluster with 64 threads and the execution time went from 102 min to 75 min. I expected a a larger improvement in speed, but I guess that's better than nothing. @rmitsch is this more or less the same kind of improvement you observed?
Thank you all for the help!
Since I'm interested in comparing multiple UMAP results, I tried to run one UMAP instance per thread - which led to all of them being deadlocked. After some digging I found that changing the Numba decorators in
umap_.py
from@numba.njit(parallel=True, fastmath=True)
to@numba.njit(parallel=False, fastmath=True)
, i. e. disabling the Numba parallelism, resolved the deadlock. Perhaps this is related to some open Numba thread coordination/deadlock issues (see e. g. https://github.com/numba/numba/issues/2804).A configuration option to disable Numba parallelism would be nice, since it enables the user to decide whether he/she wants to use Numba to parallelize or do it manually (e. g. running one UMAP model per thread) or not at all.