DUCC capping its number of threads to OMP_NUM_THREADS

v-morello commented 9 hours ago

Hello,

I'm working on SKA radio imaging pipelines which we're planning to scale using dask. As part of the effort, we've made some Python wrappers for Wgridder to be distributed. Problem: when ms2dirty() is called from a dask worker, it effectively ignores its nthreads argument, and always uses 1 thread. The fundamental reason is that dask workers are forcibly launched with OMP_NUM_THREADS to 1, which dask does for rather good reasons, but it has an unexpected interplay with DUCC.

Tracing the problem

DUCC has a ducc0_default_num_threads that reads the OMP_NUM_THREADS env variable directly. https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/85e467cad7088c500f7b7b677f3f5c399a22766a/src/ducc0/infra/threading.cc#L128

DUCC then creates a thread pool singleton where its number of workers is equal to ducc0_default_num_threads https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/85e467cad7088c500f7b7b677f3f5c399a22766a/src/ducc0/infra/threading.cc#L398

Lastly, the actual number of threads used in the Wgridder is chosen to be min(num_workers, requested_nthreads) https://gitlab.mpcdf.mpg.de/mtr/ducc/-/blob/85e467cad7088c500f7b7b677f3f5c399a22766a/src/ducc0/infra/threading.cc#L348

Therefore, if the OMP_NUM_THREADS env variable is set to 1, DUCC will always use 1 thread regardless of the nthreads argument to ms2dirty(). Unfortunately, Nanny processes in dask set OMP_NUM_THREADS to 1 before launching their managed worker: https://docs.dask.org/en/latest/array-best-practices.html#avoid-oversubscribing-threads

Hence the observed behaviour: the nthreads argument is ignored, and we have no available workaround from Python to tweak the actual number of threads used by DUCC.

Proposed Solution (preferred)

I'd suggest not basing any decision on OMP_NUM_THREADS, because DUCC0 threads are fundamentally not OpenMP threads, they're from a custom thread pool. Having a mechanism to set the maximum number of threads via an env variable is nice to have though, and DUCC0_NUM_THREADS is aptly and uniquely named for that.

Proposed Solution (alternative)

The fundamental problem from our side is that nothing can change an env variable from inside an already running process. The threadpoolctl Python library provides a context manager that can set the number of actual OpenMP threads by effectively calling omp_set_num_threads(). DUCC would react to threadpoolctl if we could replace this:

    auto evar=getenv("DUCC0_NUM_THREADS");
    // fallback
    if (!evar)
      evar=getenv("OMP_NUM_THREADS");
    if (!evar)
      return res;

by this instead:

    auto evar=getenv("DUCC0_NUM_THREADS");
    if (!evar)
      return omp_get_max_threads();

Noting that omp_get_max_threads() returns the same thing as OMP_NUM_THREADS, and if the env variable is not set, it will return the number of hardware threads as far as I understand. https://www.ibm.com/docs/en/xffbg/121.141?topic=openmp-omp-get-max-threads

Would either of these proposals be acceptable?

mreineck commented 9 hours ago

If your preferred solution is that ducc should obey the value of DUCC0_NUM_THREADS, then I'm not really sure what the problem is: whenever DUCC0_NUM_THREADS is set, it will always override OMP_NUM_THREADS. If this is not the case, please let me know!

Since ducc 0.35 you can also use another way to set the thread pool size during run time: ducc0.misc now has the function resize_thread_pool, which allows adjusting the number of available threads independently of environment variables. Would this be an acceptable workaround?

mreineck commented 9 hours ago

Concerning this suggestion:

    auto evar=getenv("DUCC0_NUM_THREADS");
    if (!evar)
      return omp_get_max_threads();

I cannot do that unfortunately, since ducc is usually compiled without OpenMP support. So there is no way for me to access omp_get_max_threads().

v-morello commented 8 hours ago

Hi Martin, thanks for such a quick reply :smile:

To clarify this point:

If your preferred solution is that ducc should obey the value of DUCC0_NUM_THREADS, then I'm not really sure what the problem is: whenever DUCC0_NUM_THREADS is set, it will always override OMP_NUM_THREADS. If this is not the case, please let me know!

That would run into the same problem: it's not possible to set an environment variable dynamically from inside a running process (in this case, the dask worker process). If we were to launch dask workers with DUCC0_NUM_THREADS=N then all DUCC tasks would use N threads regardless of the nthreads argument passed to ms2dirty(). What I'd like is the ability to choose without having to teardown the whole dask cluster.

ducc0.misc now has the function resize_thread_pool, which allows adjusting the number of available threads independently of environment variables

Oh, I was not aware of its existence, this could work nicely. I am going to give it a go and report back.

mreineck commented 8 hours ago

If we were to launch dask workers with DUCC0_NUM_THREADS=N then all DUCC tasks would use N threads regardless of the nthreads argument passed to ms2dirty().

They shouldn't. If N is larger than the nthreads argument, they should obey nthreads. So all you need to do is to set DUCC0_NUM_THREADS high enough (e.g. to the number of virtual cores on the node), correct?

mreineck commented 8 hours ago

To clarify: the thread pool would still have size N, and perhaps that's what is shown in some monitoring tools, but not all of those threads will be given work if ms2dirty is invoked with nthreads<N.

v-morello commented 7 hours ago

So all you need to do is to set DUCC0_NUM_THREADS high enough (e.g. to the number of virtual cores on the node), correct?

You're absolutely right, it would work -- I missed that angle; I'm still not too keen on that solution in principle, because it means fixing a local problem (setting nthreads for a particular function call) at a higher scope (the whole cluster).

However, resize_thread_pool() appears to work as expected, is callable from Python and provides the "local" solution I'm looking for. I'm happy to upgrade to v0.35 and use just that. I still wonder whether exposing the decoupling nthreads / thread pool size on the Python side is the best way to go in terms of "least surprise", the average Python user in me would prefer nthreads to be what I ask it to be. There might be good reasons for the way things are (and users with the opposite opinion!), so I leave that to your appreciation.

Thanks a lot for your assistance, and please feel free to close this.

mreineck commented 7 hours ago

I still wonder whether exposing the decoupling nthreads / thread pool size on the Python side is the best way to go in terms of "least surprise"

That's a good, and difficult, question. Originally I introduced the dependency on OMP_NUM_THREADS to minimize the surprise for users who normally use OpenMP and are not aware/don't care that ducc is using something else. Then the wsclean people pointed out the same problem as you did (https://gitlab.com/aroffringa/wsclean/-/issues/175), and the only halfway satisfying solution I could think of was to introduce the additional functions. Generally I think that decoupling overall thread pool size from the nthreads of individual calls is necessary, since changing the thread pool size is potentially slow, and sometimes the user knows a given call is so quick that nthreads=1 is the best option. I could, as you say, always respect the provided nthreads argument, and increase thread pool size whenever necessary ... but then the currently supported semantics of nthreads=0 (which was originally requested by the scipy developers) becomes meaningless. There doesn't seem to be a way to get it right :-(

mreineck / ducc