Closed sjperkins closed 2 years ago
I've marked this PR as a draft as it shouldn't be merged -- It exists to discuss profiling, rather than modifying functionality.
Interesting. Have you tried profiling the non-dask versions of the functions? They actually run significantly faster than the dask versions
Interesting. Have you tried profiling the non-dask versions of the functions? They actually run significantly faster than the dask versions
Yes, I can see this on my side. I probably should have been clearer. The kicker is here:
So while PyWavelets does drop the GIL, the quantity of work given to each thread may not be sufficient to fully exercise the cores.
There are 64 calls to _hdot_internal, which results in 320 calls to dwtn (320/64 == 5 levels). Then, within dwtn there are further loops over the data. I don't think the cython functions are being given enough work todo when the GIL is dropped. Therefore, everything ends up serialised, or worse.
A numba wavelet implementation may be required.
Ah man, that is not what I wanted to hear. If that was the case we should see the fraction
(time taken by dask implementation)/(time taken by serial implementation)
decrease with problem size right? I never tested this but I'll have a look. Thanks @sjperkins
Ah man, that is not what I wanted to hear. If that was the case we should see the fraction
(time taken by dask implementation)/(time taken by serial implementation)
decrease with problem size right? I never tested this but I'll have a look. Thanks @sjperkins
Hmmmm, I'm not sure.
But just to confirm what I'm saying about the data sizes not being large enough, I put a print(subband, x.shape)
here and it produces the following shapes:
Unfortunately, I think at most the cython is given 8MB of data to chew on, and a lot of the time it's much less than that. It's not really possible to exercise the cores if they don't have sufficient work to do, even if the GIL is dropped.
That makes me very sad. It might be simpler to wrap existing libraries than start from scratch though. This http://wavelet2d.sourceforge.net/ also has the Daubechies filters. Not sure how difficult they would be to wrap though
That makes me very sad. It might be simpler to wrap existing libraries than start from scratch though. This http://wavelet2d.sourceforge.net/ also has the Daubechies filters. Not sure how difficult they would be to wrap though
It may be easier to wrap a pure C implementation. Any thoughts on the suitability of the following from a correctness POV?
https://github.com/rafat/wavelib https://github.com/rafat/wavelib/wiki/DWT-Example-Code
http://www.wavelets.org/software.php
There's also the GNU Scientific Library (GSL)
https://www.gnu.org/software/gsl/doc/html/dwt.html
which appears to have Python wrappers:
pygsl would have been great but according to the docs:
The library provides functions to perform two-dimensional discrete wavelet transforms on square matrices. The matrix dimensions must be an integer power of two.
which is a bit of a severe limitation. I'll have a look at some of the other packages and get back to you. A quick glance at the first package looks promising. They have the fast discrete wavelet transforms we need but I haven't checked if they also have some of the above limitations
Also, see the demo here
https://github.com/PyWavelets/pywt/pull/230/commits/5c5d6d9b3a1ff8ce905e5d0be7430734cf0d0a85
It looks like they get some speed up using concurrent.futures so maybe we shouldn't throw the towel in just yet
Also, see the demo here
It looks like they get some speed up using concurrent.futures so maybe we shouldn't throw the towel in just yet
I see they've got a 3D wavelet transform. Do you think the current code could be modified to use wave2recn? That would improve the amount of work given to cython.
Yes, I don't see why not. I'm also wondering how much the wavelet decomposition level has to do with it because the individual blocks get smaller with increasing decomposition level, leaving less work per thread
@landmanbester I did some profiling with yappi. See below the first 10 or so functions that take up the most time:
Selecting out the _hdot_internal and wavelet calls:
Out of 6.94s total time,
The next most expensive call is concatenate which takes up 0.27 seconds.
I interpret this as the majority of time spent in dwtn, especially here:
https://github.com/PyWavelets/pywt/blob/db0172a8ea261064bbc2f0a7b26759c6a8f71d76/pywt/_multidim.py#L185-L191
So while PyWavelets does drop the GIL, the quantity of work given to each thread may not be sufficient to fully exercise the cores.