pyxem / pyxem-demos

Examples and tutorials of multi-dimensional diffraction microscopy workflows using pyxem.
31 stars 38 forks source link

Setting %env OMP_NUM_THREADS=1 speeds up the routines in example 11 tremendously #87

Open uellue opened 1 year ago

uellue commented 1 year ago

When matching a larger dataset following example 11 but using lazy arrays, running this as the very first cell before any numerical library is loaded speeds up the calculation on a machine with with 24 cores:

%env OMP_NUM_THREADS=1

When @sk1p profiled the system under load without this setting, it spent most of it's time in sched_yield instead of doing useful work. With this setting enabled (no OpenMP multithreading) it was mostly doing useful work. I didn't benchmark the difference because I ran out of patience, but it is about a factor 10.

Some routines in SciPy and NumPy are multi-threaded internally, for example OpenBLAS. It seems that Dask's/pyxem's parallelism in combination with OpenMP/OpenBLAS threading leads to oversubscription of the CPU or some other kind of scheduling issues. Restricting to only on one level of parallelism resolves this issue.

FYI we encountered a similar issue in LiberTEM. In order to avoid setting the environment variable and disabling threading altogether, we implemented a few context managers to set the thread count to 1 in code blocks that run in parallel: https://github.com/LiberTEM/LiberTEM/blob/master/src/libertem/common/threading.py

Maybe that can be useful in HyprSpy/pyxem? Perhaps this should actually be handled in Dask.

CSSFrancis commented 1 year ago

@uellue This is really useful information and a great help!

I've been suspicious of something like this happening but have never gotten around to determining why that is the case. I would imagine that dask would be very interested in this as well. Is this a problem with dask-distributed as well? I think I usually get fairly good performance with 2-4 threads per process using the distributed backend but the scheduling seems quite a bit slower than I feel it should be.

uellue commented 1 year ago

Yes, we had the same issue with dask-distributed. It is not so apparent on small machines, but a big machine will come to a crawling halt. I'm not sure if it will happen with native Dask array operations. To be tested! I'll open an issue in Dask for discussion.

uellue commented 1 year ago

https://docs.dask.org/en/stable/array-best-practices.html#avoid-oversubscribing-threads

CSSFrancis commented 1 year ago

@uellue Sounds like some better context managers is in order for hyperspy/ pyxem. Thanks for bringing this up!

By the way I am planning on making a couple of changes to the orientation mapping code in the next week or 2. Mostly to simplify the method and let it use dask-distributed so it can use multi gpus. Are there any changes you might be interested in seeing?