lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.44k stars 808 forks source link

Feature request: Disable Numba parallelism #86

Open rmitsch opened 6 years ago

rmitsch commented 6 years ago

Since I'm interested in comparing multiple UMAP results, I tried to run one UMAP instance per thread - which led to all of them being deadlocked. After some digging I found that changing the Numba decorators in umap_.py from @numba.njit(parallel=True, fastmath=True) to @numba.njit(parallel=False, fastmath=True), i. e. disabling the Numba parallelism, resolved the deadlock. Perhaps this is related to some open Numba thread coordination/deadlock issues (see e. g. https://github.com/numba/numba/issues/2804).

A configuration option to disable Numba parallelism would be nice, since it enables the user to decide whether he/she wants to use Numba to parallelize or do it manually (e. g. running one UMAP model per thread) or not at all.

lmcinnes commented 6 years ago

I agree that this will be useful. I'll have to check into how to actually make this an option -- ideally some sort of package level configuration option?

On Tue, Jul 17, 2018 at 11:59 AM Raphael Mitsch notifications@github.com wrote:

Since I'm interested in comparing multiple UMAP results, I tried to run one UMAP instance per thread - which led to all of them being deadlocked. After some digging I found that changing the Numba decorators in umap_.py from @numba.njit(parallel=True, fastmath=True) to @numba.njit(parallel=False, fastmath=True), i. e. disabling the Numba parallelism, solved the deadlock. Perhaps this is related to some open Numba thread coordination/deadlock issues (see e. g. numba/numba#2804 https://github.com/numba/numba/issues/2804).

A configuration option to disable Numba parallelism would be nice, since it enables the user to decide whether he want to use Numba to parallelize or do it manually (e. g. running one UMAP model per thread) or not at all.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/86, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBX9qDGMwtrhzGK4sLFWk17ZJ20p8ks5uHgnigaJpZM4VTH-F .

rmitsch commented 6 years ago

I'd consider an instance-level/constructor parameter as more useful - like sklearn does it with its n_jobs parameter e. g. for the RandomForestClassifier. I'm not familiar enough with Numba though to know whether the annotation framework is flexible enough for that.

On a sidenote: Stuff like https://stackoverflow.com/questions/46009368/usage-of-parallel-option-in-numba-jit-decoratior-makes-function-give-wrong-resul makes me question how reliable Numba's automatic parallelism is. But as already mentioned I don't have a lot of Numba experience, so I can't really assess that.

lmcinnes commented 6 years ago

That would definitely be preferable -- I'm not sure whether that is technically feasible. The package level thing is more likely yo be possible. I'll look into what cna actually be done.

On Tue, Jul 17, 2018 at 5:11 PM Raphael Mitsch notifications@github.com wrote:

I'd consider an instance-level/constructor parameter as more useful - like sklearn does it with its n_jobs parameter e. g. for the RandomForestClassifier. I'm not familiar enough with Numba though to know whether the annotation framework is flexible enough for that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/86#issuecomment-405729338, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBU24u48aGoJoOzc-TAnbh93M2oMcks5uHlL8gaJpZM4VTH-F .

stuartarchibald commented 6 years ago

https://github.com/numba/numba/pull/3202 is WIP to address threadsafety in the Numba parallel backend.

lmcinnes commented 6 years ago

Thanks @stuartarchibald! Looking forward to the results of that.

stuartarchibald commented 6 years ago

@lmcinnes Numba 0.40.0 rc1 was published last week if you want to give it a try https://groups.google.com/a/continuum.io/forum/#!topic/numba-users/pYAd-kT1mDM? The official release will be published shortly.

lmcinnes commented 6 years ago

Thanks @stuartarchibald ! I had been playing with some of the options of it, but mostly just little experiments. I haven't tried the thread safety yet, but I should. Thanks for the reminder.

stuartarchibald commented 6 years ago

@lmcinnes great, thanks for trying it out. The relevant docs from the dev builds are rendered here: http://numba.pydata.org/numba-doc/dev/user/threading-layer.html, these should provide some guidance on use. Any questions/problems, core devs are on gitter or feel free to open issues etc. Thanks again.

sleighsoft commented 5 years ago

Running pip install tbb can help for thread safety. Or setting https://numba.pydata.org/numba-doc/dev/reference/envvars.html#envvar-NUMBA_DISABLE_JIT to disable jit entirely.

@rmitsch Let me know if this helps you and if I can close the issue.

wmayner commented 4 years ago

This would be a really useful feature. Did anyone find a natural way of parametrizing the parallel decorator? We could write our own decorator that calls the decorated or undecorated function based on an option in the constructor as suggested by @rmitsch.

stuartarchibald commented 4 years ago

Running pip install tbb can help for thread safety. Or setting https://numba.pydata.org/numba-doc/dev/reference/envvars.html#envvar-NUMBA_DISABLE_JIT to disable jit entirely.

Just for reference there's extensive documentation on selecting an appropriate threading layer for your application here: http://numba.pydata.org/numba-doc/latest/user/threading-layer.html#selecting-a-threading-layer-for-safe-parallel-execution

If in doubt use TBB as it's safe in all parallel paradigms across all platforms.

wmayner commented 4 years ago

Thanks @stuartarchibald. I'm trying to fit multiple UMAP objects using joblib.Parallel and the following env vars:

NUMBA_NUM_THREADS=150
NUMBA_THREADING_LAYER=tbb

I'm getting errors using both prefer='threads' and prefer='processes' with joblib.Parallel (below). I also noticed that with 'threads' it seems that only one core is being used, but maybe that's an issue with joblib?

I have tbb==2020.0.133 installed.

Traceback from joblib.Parallel(prefer='threads')

EDIT: I just noticed that this was because of a typo (I used 'tbb' instead of tbb). I can use prefer='threads' without errors, but I still see that only one core is used and I get the following warning:

``` /home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible. To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help. File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/nndescent.py", line 47: @numba.njit(parallel=True) def nn_descent( ^ state.func_ir.loc)) /home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible. To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help. File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/nndescent.py", line 47: @numba.njit(parallel=True) def nn_descent( ^ state.func_ir.loc)) ```

Traceback from joblib.Parallel(prefer='processes')

``` --------------------------------------------------------------------------- _RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/errors.py", line 717, in new_error_context yield File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/lowering.py", line 260, in lower_block self.lower_inst(inst) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/lowering.py", line 414, in lower_inst func(self, inst) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/npyufunc/parfor.py", line 283, in _lower_parfor_parallel parfor.races) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/npyufunc/parfor.py", line 1196, in call_parallel_gufunc _launch_threads() File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/npyufunc/parallel.py", line 317, in _launch_threads with _backend_init_process_lock: File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/contextlib.py", line 110, in __enter__ del self.args, self.kwds, self.func AttributeError: args During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 418, in _process_worker r = call_item() File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 272, in __call__ return self.fn(*self.args, **self.kwargs) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 567, in __call__ return self.func(*args, **kwargs) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/parallel.py", line 225, in __call__ for func, args, kwargs in self.items] File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/parallel.py", line 225, in for func, args, kwargs in self.items] File "", line 15, in worker File "/home/wmayner/projects/openscope-production-analysis/differentiation_analysis/clustering.py", line 412, in perform_clustering cluster_embedding, cluster_embedder = embed(events, **params["umap"]) File "/home/wmayner/projects/openscope-production-analysis/differentiation_analysis/clustering.py", line 376, in embed fitted = umap.UMAP(**params).fit(data, y=y) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py", line 1443, in fit self.verbose, File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py", line 478, in fuzzy_simplicial_set knn_indices, knn_dists, sigmas, rhos File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/dispatcher.py", line 420, in _compile_for_args raise e File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/dispatcher.py", line 353, in _compile_for_args return self.compile(tuple(argtypes)) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock return func(*args, **kwargs) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/dispatcher.py", line 768, in compile cres = self._compiler.compile(args, return_type) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/dispatcher.py", line 77, in compile status, retval = self._compile_cached(args, return_type) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/dispatcher.py", line 91, in _compile_cached retval = self._compile_core(args, return_type) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/dispatcher.py", line 109, in _compile_core pipeline_class=self.pipeline_class) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler.py", line 528, in compile_extra return pipeline.compile_extra(func) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler.py", line 326, in compile_extra return self._compile_bytecode() File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler.py", line 385, in _compile_bytecode return self._compile_core() File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler.py", line 365, in _compile_core raise e File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler.py", line 356, in _compile_core pm.run(self.state) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler_machinery.py", line 328, in run raise patched_exception File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler_machinery.py", line 319, in run self._runPass(idx, pass_inst, state) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock return func(*args, **kwargs) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler_machinery.py", line 281, in _runPass mutated |= check(pss.run_pass, internal_state) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/compiler_machinery.py", line 268, in check mangled = func(compiler_state) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py", line 380, in run_pass NativeLowering().run_pass(state) # TODO: Pull this out into the pipeline File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py", line 325, in run_pass lower.lower() File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/lowering.py", line 179, in lower self.lower_normal_function(self.fndesc) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/lowering.py", line 220, in lower_normal_function entry_block_tail = self.lower_function_body() File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/lowering.py", line 245, in lower_function_body self.lower_block(block) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/lowering.py", line 260, in lower_block self.lower_inst(inst) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/contextlib.py", line 130, in __exit__ self.gen.throw(type, value, traceback) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/errors.py", line 725, in new_error_context six.reraise(type(newerr), newerr, tb) File "/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/six.py", line 669, in reraise raise value numba.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend) args File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py", line 331: def compute_membership_strengths(knn_indices, knn_dists, sigmas, rhos): rows = np.zeros((n_samples * n_neighbors), dtype=np.int64) cols = np.zeros((n_samples * n_neighbors), dtype=np.int64) ^ [1] During: lowering "id=1[LoopNest(index_variable = parfor_index.294, range = (0, $0.22, 1))]{281: }Var(parfor_index.294, /home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py (331))" at /home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py (331) ------------------------------------------------------------------------------- This should not have happened, a problem has occurred in Numba's internals. You are currently using Numba version 0.46.0. Please report the error message and traceback, along with a minimal reproducer at: https://github.com/numba/numba/issues/new If more help is needed please feel free to speak to the Numba core developers directly at: https://gitter.im/numba/numba Thanks in advance for your help in improving Numba! """ The above exception was the direct cause of the following exception: LoweringError Traceback (most recent call last) in 19 # results = list(starmap(worker, tqdm(args))) 20 results = Parallel(n_jobs=len(args))( ---> 21 delayed(worker)(session, df) for session, df in args 22 ) ~/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable) 932 933 with self._backend.retrieval_context(): --> 934 self.retrieve() 935 # Make sure that we get a last message telling us we are done 936 elapsed_time = time.time() - self._start_time ~/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/parallel.py in retrieve(self) 831 try: 832 if getattr(self._backend, 'supports_timeout', False): --> 833 self._output.extend(job.get(timeout=self.timeout)) 834 else: 835 self._output.extend(job.get()) ~/miniconda3/envs/openscope/lib/python3.7/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 519 AsyncResults.get from multiprocessing.""" 520 try: --> 521 return future.result(timeout=timeout) 522 except LokyTimeoutError: 523 raise TimeoutError() ~/miniconda3/envs/openscope/lib/python3.7/concurrent/futures/_base.py in result(self, timeout) 433 raise CancelledError() 434 elif self._state == FINISHED: --> 435 return self.__get_result() 436 else: 437 raise TimeoutError() ~/miniconda3/envs/openscope/lib/python3.7/concurrent/futures/_base.py in __get_result(self) 382 def __get_result(self): 383 if self._exception: --> 384 raise self._exception 385 else: 386 return self._result LoweringError: Failed in nopython mode pipeline (step: nopython mode backend) args File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py", line 331: def compute_membership_strengths(knn_indices, knn_dists, sigmas, rhos): rows = np.zeros((n_samples * n_neighbors), dtype=np.int64) cols = np.zeros((n_samples * n_neighbors), dtype=np.int64) ^ [1] During: lowering "id=1[LoopNest(index_variable = parfor_index.294, range = (0, $0.22, 1))]{281: }Var(parfor_index.294, /home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py (331))" at /home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/umap/umap_.py (331) ------------------------------------------------------------------------------- This should not have happened, a problem has occurred in Numba's internals. You are currently using Numba version 0.46.0. Please report the error message and traceback, along with a minimal reproducer at: https://github.com/numba/numba/issues/new If more help is needed please feel free to speak to the Numba core developers directly at: https://gitter.im/numba/numba Thanks in advance for your help in improving Numba! ```

Pinging @lmcinnes in case this helps polish up the various issues with parallelization.

stuartarchibald commented 4 years ago

Thanks @stuartarchibald. I'm trying to fit multiple UMAP objects using joblib.Parallel and the following env vars:

Thanks for this.

NUMBA_NUM_THREADS=150
NUMBA_THREADING_LAYER=tbb

^ that's a lot of threads, have you got hardware with 150 physical cores?

I'm getting errors using both prefer='threads' and prefer='processes' with joblib.Parallel (below). I also noticed that with 'threads' it seems that only one core is being used, but maybe that's an issue with joblib?

I have tbb==2020.0.133 installed.

Traceback from joblib.Parallel(prefer='threads')

EDIT: I just noticed that this was because of a typo (I used 'tbb' instead of tbb). I can use prefer='threads' without errors, but I still see that only one core is used and I get the following warning:

I also noticed that before the **EDIT** that this failed somewhat ungracefully, LoweringError when the problem was a ValueError for 'tbb' being invalid? Will try and replicate + fix.

/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^

  state.func_ir.loc))
/home/wmayner/miniconda3/envs/openscope/lib/python3.7/site-packages/numba/typed_passes.py:271: NumbaPerformanceWarning: 
The keyword argument 'parallel=True' was specified but no transformation for parallel execution was possible.

To find out why, try turning on parallel diagnostics, see http://numba.pydata.org/numba-doc/latest/user/parallel.html#diagnostics for help.

File "../../miniconda3/envs/openscope/lib/python3.7/site-packages/umap/nndescent.py", line 47:
    @numba.njit(parallel=True)
    def nn_descent(
    ^

  state.func_ir.loc))

The above suggests that no parallel transform was made, which may be why you only see 1 core in use?

Traceback from joblib.Parallel(prefer='processes')

The <details> of the above suggest that _launch_threads failed badly, which I think is fixed in 0.47.0 (the context manager was basically broken on some platforms), fix was in https://github.com/numba/numba/pull/4755/commits/15e75688362e97352cd2102e1f880a8f77dd7471 and merged as part of https://github.com/numba/numba/pull/4755. Hopefully, if you update to 0.47.0 it'll permit the actual error message/problem to be discovered.

Pinging @lmcinnes in case this helps polish up the various issues with parallelization.

wmayner commented 4 years ago

^ that's a lot of threads, have you got hardware with 150 physical cores?

The hardware I'm using has 160 logical CPUs: 4 sockets of 20 physical cores, each capable of running 2 threads. Should I limit numba's parallelism to just the 80 physical cores?

I also noticed that before the EDIT that this failed somewhat ungracefully, LoweringError when the problem was a ValueError for 'tbb' being invalid? Will try and replicate + fix.

Yes, it didn't fail gracefully; it looked like the error was the same as in the prefer='processes' case.

Hopefully, if you update to 0.47.0 it'll permit the actual error message/problem to be discovered.

Thanks, I'll give this a shot!

stuartarchibald commented 4 years ago

^ that's a lot of threads, have you got hardware with 150 physical cores?

The hardware I'm using has 160 logical CPUs: 4 sockets of 20 physical cores, each capable of running 2 threads. Should I limit numba's parallelism to just the 80 physical cores?

I'm not sure what the performance profile of your code is, but it's probably worth developing a performance testing harness that stresses a real-world case and then incrementally moving the thread count nearer to physical core count and seeing what happens.

I also noticed that before the EDIT that this failed somewhat ungracefully, LoweringError when the problem was a ValueError for 'tbb' being invalid? Will try and replicate + fix.

Yes, it didn't fail gracefully; it looked like the error was the same as in the prefer='processes' case.

Hopefully, if you update to 0.47.0 it'll permit the actual error message/problem to be discovered.

Thanks, I'll give this a shot!

Great, thanks, let us know how you get on!

apcamargo commented 4 years ago

Is there any update on this? I’m running UMAP in loop for hundreds of different datasets and for some reason it is not using more than one thread.

I’d be great if I could disable Numba parallelism to use joblib instead.

rmitsch commented 4 years ago

@apcamargo Using numba.config.THREADING_LAYER = 'tbb' pretty much solved the problem for me. Same use-case of generating 10^2 to 10^3 UMAP embeddings in parallel.

apcamargo commented 4 years ago

Thank you @rmitsch! I've tested your suggestion but the execution times remained more or less the same. In this case I think it's safer to not use joblib because I don't want to add an additional dependency (tbb) and I suspect it is Intel-only.

I'll try to figure out why the default parallel implementation is not working for me.

stuartarchibald commented 4 years ago

@apcamargo tbb is available on all platforms.

apcamargo commented 4 years ago

Didn't know that!

What I did was set NUMBA_NUM_THREADS to 1 and THREADING_LAYER to 'tbb' and put the loop inside a joblib helper class. For some reason I didn't observe any speed improvements

lmcinnes commented 4 years ago

Currently if you don't have pynndescent installed and fix a random seed you should avoid all of the numba induced parallelism.

apcamargo commented 4 years ago

Currently if you don't have pynndescent installed and fix a random seed you should avoid all of the numba induced parallelism.

I’m using a fixed seed. It’s strange that joblib is not improving the speed at all.

lmcinnes commented 4 years ago

It could be that the memory moving cost is more than the CPU cost for some of the operations. I am honestly not sure. As others noted tbb and NUMBA_NUM_THREADS=1 should also make things work.

apcamargo commented 4 years ago

I tried in a cluster with 64 threads and the execution time went from 102 min to 75 min. I expected a a larger improvement in speed, but I guess that's better than nothing. @rmitsch is this more or less the same kind of improvement you observed?

Thank you all for the help!