Comparison of Julia with pure-python LLE implementation

DanHickstein commented 2 years ago

According to this page, the Julia implementation of the LLE is more than 4x faster than the pure-python implementation. I am not a specialist in the LLE, but looking at the code, I would assume that the FFTs are the rate-limiting step for the process. If this is the case, then I would expect that there would be minimal difference in Python versus Julia, since both languages should be calling similar FFT routines that are written in C or Fortran.

Python can have very different speeds for the FFT depending on which backend is used. If I recall correctly, np.fft uses c-implementations of the FFT accessed through BLAS/LAPACK, and the speed can depend dramatically on which implementation is being used. For intel processors, if the Intel MKL is used as the backend, then good performance is obtained. However, if OpenBLAS is used (common on Linux I believe) the results can be a factor-of-a-few slower. I think that scipy.fftpack uses a fortran implementation that has more consistent performance, but not always the fastest. We encountered this issue with the PyNLO package, and a similar issue with the PyAbel package: https://pyabel.readthedocs.io/en/latest/transform_methods/comparison.html#low-level-computational-efficiency

So, I'm curious about a few things.

1) Have you done some profiling to show what steps limit the speed of the LLE solver?

2) Could the speedup that you saw going from Python to Julia be accounted for by a switch in the underlying FFT or linear algebra implementations?

3) Do you still have the pure-python LLE implementation?

gregmoille commented 2 years ago

I agree with you that one of the main throttles is the FFT, and the main point in the core of pyLLE through the Julia code is the pre-allocation that transforms it simply into a matrix multiplication instead of an entire operation at every loop. I think this feature is underlying of the Fortran back-end library, and I think should be available through BLAS/LAPACK even through python, but never actually check it in detail. For sure, if you implement with MKL that would allow for a speed-up of the python core (although at UMD I only bought Ryzen 9 or threadrippers CPU, so can't use it). In the comparison I draw, I simply use the regular scipy which is not the fastest. So the "45 min" claim with the python code is really for a non-optimized code in any sort of way, and could be definitely brought down.

One thing to note other than the FFT operation is for loops that are generally inefficient in python. Although you could use numba or other C-wrapper in python, Julia still seems to be faster simply through better allocation and handling of the loops. I don't know where it comes from. They claim that Julia is as fast as C and the pre-allocation is really efficient, so maybe it comes from that. At the end of the day, python is not known to be the fastest framework.

Overall I thought that coding in Julia could be a good exercise for me to learn a new language, and pyLLE turned into a bigger thing than I originally planned, and some of the original sins are still around.

I can try to dig into my folders and backup to check if I have the original python code, can't guarantee but I'll let you know if I find it.

DanHickstein commented 2 years ago

Thanks for the detailed information @gregmoille! I'd be curious to see what the speed difference is between this Julia approach and just throwing the equation to scipy.integrate.complex_ode, as is done by pyGLLE (see here and here).

At the end of the day, python is not known to be the fastest framework.

I'd completely agree that, as a non-compiled language, you are of course correct. But, generally the functions you call from Numpy/Scipy are pretty darn optimized. So, as long as you can avoid doing a bunch of loops in python itself, then python code can be nearly as fast as anything else.

Anyway, I'll close this issue, but might circle back to this when I have time. Thanks again for the information!

gregmoille / pyLLE

Comparison of Julia with pure-python LLE implementation #39