Closed v-morello closed 7 hours ago
If your preferred solution is that ducc
should obey the value of DUCC0_NUM_THREADS
, then I'm not really sure what the problem is: whenever DUCC0_NUM_THREADS
is set, it will always override OMP_NUM_THREADS
. If this is not the case, please let me know!
Since ducc
0.35 you can also use another way to set the thread pool size during run time: ducc0.misc
now has the function resize_thread_pool
, which allows adjusting the number of available threads independently of environment variables. Would this be an acceptable workaround?
Concerning this suggestion:
auto evar=getenv("DUCC0_NUM_THREADS");
if (!evar)
return omp_get_max_threads();
I cannot do that unfortunately, since ducc
is usually compiled without OpenMP support. So there is no way for me to access omp_get_max_threads()
.
Hi Martin, thanks for such a quick reply :smile:
To clarify this point:
If your preferred solution is that ducc should obey the value of DUCC0_NUM_THREADS, then I'm not really sure what the problem is: whenever DUCC0_NUM_THREADS is set, it will always override OMP_NUM_THREADS. If this is not the case, please let me know!
That would run into the same problem: it's not possible to set an environment variable dynamically from inside a running process (in this case, the dask worker process). If we were to launch dask workers with DUCC0_NUM_THREADS=N
then all DUCC tasks would use N
threads regardless of the nthreads
argument passed to ms2dirty()
. What I'd like is the ability to choose without having to teardown the whole dask cluster.
ducc0.misc now has the function resize_thread_pool, which allows adjusting the number of available threads independently of environment variables
Oh, I was not aware of its existence, this could work nicely. I am going to give it a go and report back.
If we were to launch dask workers with DUCC0_NUM_THREADS=N then all DUCC tasks would use N threads regardless of the nthreads argument passed to
ms2dirty()
.
They shouldn't. If N
is larger than the nthreads
argument, they should obey nthreads
. So all you need to do is to set DUCC0_NUM_THREADS
high enough (e.g. to the number of virtual cores on the node), correct?
To clarify: the thread pool would still have size N
, and perhaps that's what is shown in some monitoring tools, but not all of those threads will be given work if ms2dirty
is invoked with nthreads<N
.
So all you need to do is to set DUCC0_NUM_THREADS high enough (e.g. to the number of virtual cores on the node), correct?
You're absolutely right, it would work -- I missed that angle; I'm still not too keen on that solution in principle, because it means fixing a local problem (setting nthreads for a particular function call) at a higher scope (the whole cluster).
However, resize_thread_pool()
appears to work as expected, is callable from Python and provides the "local" solution I'm looking for. I'm happy to upgrade to v0.35 and use just that. I still wonder whether exposing the decoupling nthreads / thread pool size on the Python side is the best way to go in terms of "least surprise", the average Python user in me would prefer nthreads
to be what I ask it to be. There might be good reasons for the way things are (and users with the opposite opinion!), so I leave that to your appreciation.
Thanks a lot for your assistance, and please feel free to close this.
I still wonder whether exposing the decoupling nthreads / thread pool size on the Python side is the best way to go in terms of "least surprise"
That's a good, and difficult, question.
Originally I introduced the dependency on OMP_NUM_THREADS
to minimize the surprise for users who normally use OpenMP and are not aware/don't care that ducc
is using something else. Then the wsclean
people pointed out the same problem as you did (https://gitlab.com/aroffringa/wsclean/-/issues/175), and the only halfway satisfying solution I could think of was to introduce the additional functions.
Generally I think that decoupling overall thread pool size from the nthreads
of individual calls is necessary, since changing the thread pool size is potentially slow, and sometimes the user knows a given call is so quick that nthreads=1
is the best option.
I could, as you say, always respect the provided nthreads
argument, and increase thread pool size whenever necessary ... but then the currently supported semantics of nthreads=0
(which was originally requested by the scipy
developers) becomes meaningless. There doesn't seem to be a way to get it right :-(
Hello,
I'm working on SKA radio imaging pipelines which we're planning to scale using dask. As part of the effort, we've made some Python wrappers for Wgridder to be distributed. Problem: when
ms2dirty()
is called from a dask worker, it effectively ignores itsnthreads
argument, and always uses 1 thread. The fundamental reason is that dask workers are forcibly launched withOMP_NUM_THREADS
to 1, which dask does for rather good reasons, but it has an unexpected interplay with DUCC.Tracing the problem
DUCC has a
ducc0_default_num_threads
that reads theOMP_NUM_THREADS
env variable directly. https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/85e467cad7088c500f7b7b677f3f5c399a22766a/src/ducc0/infra/threading.cc#L128DUCC then creates a thread pool singleton where its number of workers is equal to
ducc0_default_num_threads
https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/85e467cad7088c500f7b7b677f3f5c399a22766a/src/ducc0/infra/threading.cc#L398Lastly, the actual number of threads used in the Wgridder is chosen to be
min(num_workers, requested_nthreads)
https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/85e467cad7088c500f7b7b677f3f5c399a22766a/src/ducc0/infra/threading.cc#L348Therefore, if the
OMP_NUM_THREADS
env variable is set to 1, DUCC will always use 1 thread regardless of thenthreads
argument toms2dirty()
. Unfortunately, Nanny processes in dask setOMP_NUM_THREADS
to 1 before launching their managed worker: https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threadsHence the observed behaviour: the
nthreads
argument is ignored, and we have no available workaround from Python to tweak the actual number of threads used by DUCC.Proposed Solution (preferred)
I'd suggest not basing any decision on
OMP_NUM_THREADS
, because DUCC0 threads are fundamentally not OpenMP threads, they're from a custom thread pool. Having a mechanism to set the maximum number of threads via an env variable is nice to have though, andDUCC0_NUM_THREADS
is aptly and uniquely named for that.Proposed Solution (alternative)
The fundamental problem from our side is that nothing can change an env variable from inside an already running process. The threadpoolctl Python library provides a context manager that can set the number of actual OpenMP threads by effectively calling
omp_set_num_threads()
. DUCC would react tothreadpoolctl
if we could replace this:by this instead:
Noting that
omp_get_max_threads()
returns the same thing asOMP_NUM_THREADS
, and if the env variable is not set, it will return the number of hardware threads as far as I understand. https://www.ibm.com/docs/en/xffbg/121.141?topic=openmp-omp-get-max-threadsWould either of these proposals be acceptable?