pyxem / pyxem-demos

Examples and tutorials of multi-dimensional diffraction microscopy workflows using pyxem.
30 stars 38 forks source link

Setting %env OMP_NUM_THREADS=1 speeds up the routines in example 11 tremendously #87

Open uellue opened 11 months ago

uellue commented 11 months ago

When matching a larger dataset following example 11 but using lazy arrays, running this as the very first cell before any numerical library is loaded speeds up the calculation on a machine with with 24 cores:

%env OMP_NUM_THREADS=1

When @sk1p profiled the system under load without this setting, it spent most of it's time in sched_yield instead of doing useful work. With this setting enabled (no OpenMP multithreading) it was mostly doing useful work. I didn't benchmark the difference because I ran out of patience, but it is about a factor 10.

Some routines in SciPy and NumPy are multi-threaded internally, for example OpenBLAS. It seems that Dask's/pyxem's parallelism in combination with OpenMP/OpenBLAS threading leads to oversubscription of the CPU or some other kind of scheduling issues. Restricting to only on one level of parallelism resolves this issue.

FYI we encountered a similar issue in LiberTEM. In order to avoid setting the environment variable and disabling threading altogether, we implemented a few context managers to set the thread count to 1 in code blocks that run in parallel: https://github.com/LiberTEM/LiberTEM/blob/master/src/libertem/common/threading.py

Maybe that can be useful in HyprSpy/pyxem? Perhaps this should actually be handled in Dask.

CSSFrancis commented 11 months ago

@uellue This is really useful information and a great help!

I've been suspicious of something like this happening but have never gotten around to determining why that is the case. I would imagine that dask would be very interested in this as well. Is this a problem with dask-distributed as well? I think I usually get fairly good performance with 2-4 threads per process using the distributed backend but the scheduling seems quite a bit slower than I feel it should be.

uellue commented 11 months ago

Yes, we had the same issue with dask-distributed. It is not so apparent on small machines, but a big machine will come to a crawling halt. I'm not sure if it will happen with native Dask array operations. To be tested! I'll open an issue in Dask for discussion.

uellue commented 11 months ago

https://docs.dask.org/en/stable/array-best-practices.html#avoid-oversubscribing-threads

CSSFrancis commented 11 months ago

@uellue Sounds like some better context managers is in order for hyperspy/ pyxem. Thanks for bringing this up!

By the way I am planning on making a couple of changes to the orientation mapping code in the next week or 2. Mostly to simplify the method and let it use dask-distributed so it can use multi gpus. Are there any changes you might be interested in seeing?