per-caller-thread limits

joblib / threadpoolctl

Python helpers to limit the number of threads used in native libraries that handle their own internal threadpool (BLAS and OpenMP implementations)

BSD 3-Clause "New" or "Revised" License

336 stars 30 forks source link

per-caller-thread limits #76

Open orenbenkiki opened 4 years ago

orenbenkiki commented 4 years ago

The threadpool_limits are global. This makes it difficult to avoid oversubscription when invoking parallel operations (e.g., Numpy functions) from within a parallel divide-and-conquer algorithm.

Ideally, parallel multi-threading frameworks would be fully multi-threaded-aware, that is, have a limit on the total number of threads used, regardless of how many threads are generating requests. This however seems too much to ask for :-(

A simpler modification would be to set per-caller-thread limits. This way, a divide-and-conquer algorithm could, at each step, subdivide the total budget of threads. As an secondary upside, a budget of odd number of (2n+1) threads could be split to (n) threads for one sub-task and (n+1) threads for another, fully utilizing all threads, rather than setting a global budget of (n) threads per each (missing out on one) or (n+1) for each (oversubscribing).

Is such finer-grained control over thread limits possible? If so, I'd love to see support for it in threadpoolctl.

jeremiedbb commented 4 years ago

As far as I know, it's not possible in a multi-threaded setting, because we can only set the maximum number of threads of the native C libraries (BLAS, openmp) globally. I think MKL has a way to somehow specify a thread local max threads but I'm not sure it solves this issue.

It might be different if your algorithm uses multi-processing. Then you should be able to set the number of threads in each subprocess. You'd still have to keep track on the budget at each step, the scheduling part is outside of the scope of threadpoolctl.

ogrisel commented 3 years ago

The standard OpenMP and BLAS APIs do not provide a generic way to do this as @jeremiedbb said above. It would be great to lobby BLAS implementation developers to provide a consistent API to set the parallelism budget on a per-BLAS-call thread-local basis. I think the BLIS developers intended to do so a while ago but I have not followed their development recently and as far as I know OpenBLAS does not provide anything like this.

orenbenkiki commented 3 years ago

@jeremiedbb seems correct in that, if one uses multi-processing, it is possible to use threadpoolctl to set a different "global" threads limit in each sub-process - at least, this seems to be working for me. That is, I use multiprocessing.Pool.map and wrap each invocation with a function that checks to see if it the 1st one running in the sub-process, and if so, it first ask threadpoolctl for a reduced number of threads and only then does the actual work.

ogrisel commented 2 years ago

Based on the last reply, I have the feeling that we can close this issue.

orenbenkiki commented 2 years ago

Note that the workaround has significant disadvantages:

It only works using multiprocessing and not multithreading
It requires trickery to set the number of threads per process only once (in the 1st invocation of a task on the process)
It is inefficient since long-running task(s) on few sub-processes are limited to using few CPUs due to the static allocation of threads to sub-processes, even when most CPUs are idle (other sub-processes having completed execution)

ogrisel commented 2 years ago

What you describe is something far beyond the scope of threadpoolctl. I think what you want is close to what TBB offers with a full-fludged task scheduler. However that would require all the threaded tools of the ecosystem (BLAS, machine learning libraries, signal processing libraries...) to use TBB instead of OpenMP... and currently in the Python word for instance, Cython does not have syntactic support for interfacing with a TBB runtime (as far as I know).

Also note that TBB has its limitations w.r.t. over-subscription in practical deployment scenarii like docker containers, see: https://github.com/oneapi-src/oneTBB/issues/190 . They might be fixable though.

ogrisel commented 2 years ago

I can reopen with the issue with a more descriptive title, however it's unlikely to ever be solved because major BLAS implementations (e.g. OpenBLAS) do not offer such control (maybe BLIS does?) and this is not part of the OpenMP standard either (as far as I know).

orenbenkiki commented 2 years ago

It is indeed a tough problem and might not be solvable in general, and OpenMP/BLAS etc. don't make it easy.

It is much easier is "everyone" agrees on a single scheduler technology (e.g. TBB). This is a place where Julia has an advantage being new, and having multi-threaded scheduler within the language from an early stage, most packages tend to just use it so there's automatic balancing across multi-threaded apps.

That said, if we don't have it as an open issue then things wouldn't ever get any better...

ogrisel commented 2 years ago

The thing is that this problem will not be solved in threadpoolctl itself. So better open such issues on the issue tracker of open source BLAS/LAPACK implementations (starting with OpenBLAS and BLIS) and maybe OpenMP runtime implementations, although I am not sure their maintainers will be interesting in maintaining a feature that is not part of the OpenMP specification.