styfenschaer / rocket-fft

Rocket-FFT makes Numba aware of numpy.fft and scipy.fft. Rocket-FFT takes its name from the PocketFFT Fast Fourier Transformation library that powers it, and Numba's goal of making your scientific Python code blazingly fast - like a rocket. 🚀
BSD 3-Clause "New" or "Revised" License
67 stars 2 forks source link

pyinstaller hook for rocket-fft #10

Closed GeorgWa closed 8 months ago

GeorgWa commented 8 months ago

Hi,

I'm currently trying to ship a python application which uses rocket-fft and numba using pyinstaller. Unfortunately, pyinstaller fails to recognize rocket-fft and won't include it in the static build.

Is there a way how to explicitly import rocketfft and trigger the numba plugin import?

best, Georg

styfenschaer commented 8 months ago

Hi @GeorgWa

You can explicitly trigger the initialization of rocket-fft as a Numba extension by doing the following:

import rocket_fft
rocket_fft._init_extension()

Hope that works!

GeorgWa commented 8 months ago

Hi @styfenschaer ,

thanks, I will try this! I might also call into the pocketfft module directly as I have a very narrow usecase.

Not related at all, but do you have any obvious recommendation how to speed up calculations? After reading your packages code I suspect you are somewhat of an expert, also for higher dimensional fft.

What is the tradeoff in applying a 2d fft axis wise vs, applying it once in 3d?

@nb.njit
def convolve_fourier_a0(dense, kernel):
    """
    Numba helper function to apply a gaussian filter to a dense stack.

    Parameters
    ----------

    dense : np.ndarray
        Array of shape (n_tofs, n_scans, n_frames)

    kernel : np.ndarray
        Array of shape (n_scans, n_frames)

    Returns
    -------

    np.ndarray
        Array of shape (n_tofs, n_scans, n_frames) containing the filtered dense stack.

    """

    k0, k1 = kernel.shape

    out = np.zeros_like(dense)

    fourier_filter = np.fft.rfft2(kernel, dense.shape[1:])
    for i in range(dense.shape[0]):
        out[i] = roll(
            np.fft.irfft2(np.fft.rfft2(dense[i]) * fourier_filter), -k0 // 2, -k1 // 2
        )

    return out
styfenschaer commented 8 months ago

If possible, always carry out the multidimensional transformation of a single array. Multidimensional transformations are explicitly implemented and also allow for parallelism.

It's hard to give general tips on how to speed up your function. There are certainly changes that will improve performance. However, whether it is worth implementing them depends on the size of your data. If your arrays are large, the FFT is what dominates the runtime of this function. If the arrays are small, there may be some things to change. For example, roll (I assume it's np.roll?) makes a copy of your data. You might want to write directly to out[i]. But as I said, I would carefully consider what is worth changing.

GeorgWa commented 8 months ago

I have to apply the fft on a huge number of small arrays. They all are rather small, so most of the time below (12, 2, 256), and also optimized fft-friendly shape dimensions. Looking at allocations is a good recommendation. I think I can improve the roll to be more in-place.

Regarding the multithreading, the function is already being called from within a Threadpool without GIL and I would like to have strictly no threading on the level of the FFT. At the moment, when using the low-level interface, I set nthreads in numba_c2r to 1, is there another way how to enforce no threading?

styfenschaer commented 8 months ago

Avoiding memory allocation in tight loops is definitely desirable. You may be interested in a post I wrote about some of the internals of rocket-fft. It shows that avoiding copying small arrays can make a difference! Unfortunately, I've already gotten rid of most allocations/copies. In the high level interface, there may still be some allocations (depending on the arguments used to call the function). So using the low-level interface can bring small advantages, as you have already guessed.

Regarding the number of threads: No, there is no other way to set the number of threads at runtime.

PS: Also be careful with broadcasting and lines like np.fft.rfft2(dense[i]) * fourier_filter. They likely result in new allocations. You may want to use explicit loops instead.

GeorgWa commented 8 months ago

Thanks for sharing your thoughts and rocketfft! You can close the issue if you like.