ld-decode Possible performance optimizations

oyvindln commented 1 year ago

master issue for various performance bottlenecks that could be improved on

Memory bandwidth/use between threads

As identified by several people, there is a fair bit of time spend on shuffling data to and back from the demod threads, and to concatenate the data afterwards, just removing the completely unused data in the shared recarray in #796 gave a notable improvement in performance, but there is more that could be improved

demod_raw is only used in one spot in the dropout detect function to check where the data exceeds a threshold, this could as well be done in the demod threads themselves, storing the boolean array data on where the thr is exceeded instead which should be much smaller.
demod_burst would likely be sufficient to store as 32-bit instead of 64-bit float since the data will be around where the floating point precision is high anyhow.
ideally we should use shared memory for the result data if possible to avoid copying between threads, as noted by limer and putnam on irc/discord (they indicated they may submit a PR for this when they are back home)

FFT

The real-part only rfft functions should be used rather than fft where we don't need the imaginary part (which is only needed for the hilbert/demod function afaik), as they are gonna be faster and we don't need to store much data for the fft filters either.

We're using pyfft rather than numpy's fft for speed improvements as of now. It has a bunch of settings/caching one could maybe play around with to improve things. It's currently not used on windows as it seems to conflict with using Thread instead of Process (which doesn't work on win with the current code).

numba/native code optimization

Some of the tbc/sync stuff could benefit a ton from using numba (or alternatively cython or similar) as a lot of logic is being done in loops which is slow in python - dropout_detect_demod, refine_linelocs_pilot and refine_linelocs_hsync in particular, but probably more. (The last one I've implemented partially in cython in vhs-decode)

Any runs involving EFM will have a fair bit of extra startup time as it uses numba classes which the compilation can't be cached for, so it has to be re-compiled on every run. If we start using cython or similar in ld-decode it might be worth using that for this purpose instead.

JSON

I don't know if this has a large performance hit in practice but as of now we are rewriting the whole json rather than appending to the file, which can get pretty large on large runs. Might be worth looking if it's feasible to just append the file and modify the needed stuff at the start instead.

atsampson commented 1 year ago

efm_pll.py was originally written in C++, and was about 4x faster in C++ than the first numba version. I think it would be worth investigating calling into C++ helper functions for performance-sensitive things like this from the Python code (especially given the recent massive increases in electricity prices in the UK!) - it shouldn't be very difficult to build a C++-based Python module with CMake.

It may be worth checking whether pyFFTW is still faster than scipy.fft - it was a bit faster at the time, but scipy may have improved since.

atsampson commented 1 year ago

Re 32-bit vs. 64-bit float, the same is true in some of the tools since they only need ~16 bits of precision. ld-chroma-decoder uses doubles throughout at the moment, but I expect it'd work fine with floats (~24 bits), roughly halving the memory bandwidth needed.

oyvindln commented 1 year ago

Another thing that seems to take up a bit of time is concatenating the input/output arrays in demodcache.

When reading from file it reads into a list of blocks (arrays) and then concatenates them afterwards. Maybe it would be possible to read into the larger array in one go and/or read a bit more at a time to avoid spending as much time on that.

The output side might be more complicated.

putnam commented 1 year ago

Just noticed I got a mention here

Another thing I had started to work on but didn't have the time to see it through: I think it's not impossible, maybe even easy(?) to migrate the code to work with CuPy (i.e., run on GPU). There are some fundamental structures in use with ld-decode that need some tweaking in order to port the code over, but for the most part it is a drop-in replacement for scipy. This would also have the benefit of being backward compatible.

happycube commented 1 year ago

I converted the filters and FFT processing to float32/complex64 (which should convert much of the TBC code further down the line), and performance is 15% higher on my AVX1 Sandy Bridge Xeon. (haven't benchmarked Haswell yet)

happycube / ld-decode