happycube / ld-decode

Software defined LaserDisc decoder
GNU General Public License v3.0
296 stars 76 forks source link

ld-decode Possible performance optimizations #802

Open oyvindln opened 1 year ago

oyvindln commented 1 year ago

master issue for various performance bottlenecks that could be improved on

Memory bandwidth/use between threads

As identified by several people, there is a fair bit of time spend on shuffling data to and back from the demod threads, and to concatenate the data afterwards, just removing the completely unused data in the shared recarray in #796 gave a notable improvement in performance, but there is more that could be improved

FFT

The real-part only rfft functions should be used rather than fft where we don't need the imaginary part (which is only needed for the hilbert/demod function afaik), as they are gonna be faster and we don't need to store much data for the fft filters either.

We're using pyfft rather than numpy's fft for speed improvements as of now. It has a bunch of settings/caching one could maybe play around with to improve things. It's currently not used on windows as it seems to conflict with using Thread instead of Process (which doesn't work on win with the current code).

numba/native code optimization

Some of the tbc/sync stuff could benefit a ton from using numba (or alternatively cython or similar) as a lot of logic is being done in loops which is slow in python - dropout_detect_demod, refine_linelocs_pilot and refine_linelocs_hsync in particular, but probably more. (The last one I've implemented partially in cython in vhs-decode)

Any runs involving EFM will have a fair bit of extra startup time as it uses numba classes which the compilation can't be cached for, so it has to be re-compiled on every run. If we start using cython or similar in ld-decode it might be worth using that for this purpose instead.

JSON

I don't know if this has a large performance hit in practice but as of now we are rewriting the whole json rather than appending to the file, which can get pretty large on large runs. Might be worth looking if it's feasible to just append the file and modify the needed stuff at the start instead.

atsampson commented 1 year ago

efm_pll.py was originally written in C++, and was about 4x faster in C++ than the first numba version. I think it would be worth investigating calling into C++ helper functions for performance-sensitive things like this from the Python code (especially given the recent massive increases in electricity prices in the UK!) - it shouldn't be very difficult to build a C++-based Python module with CMake.

It may be worth checking whether pyFFTW is still faster than scipy.fft - it was a bit faster at the time, but scipy may have improved since.

atsampson commented 1 year ago

Re 32-bit vs. 64-bit float, the same is true in some of the tools since they only need ~16 bits of precision. ld-chroma-decoder uses doubles throughout at the moment, but I expect it'd work fine with floats (~24 bits), roughly halving the memory bandwidth needed.

oyvindln commented 1 year ago

Another thing that seems to take up a bit of time is concatenating the input/output arrays in demodcache.

When reading from file it reads into a list of blocks (arrays) and then concatenates them afterwards. Maybe it would be possible to read into the larger array in one go and/or read a bit more at a time to avoid spending as much time on that.

The output side might be more complicated.

putnam commented 1 year ago

Just noticed I got a mention here

Another thing I had started to work on but didn't have the time to see it through: I think it's not impossible, maybe even easy(?) to migrate the code to work with CuPy (i.e., run on GPU). There are some fundamental structures in use with ld-decode that need some tweaking in order to port the code over, but for the most part it is a drop-in replacement for scipy. This would also have the benefit of being backward compatible.

happycube commented 1 year ago

I converted the filters and FFT processing to float32/complex64 (which should convert much of the TBC code further down the line), and performance is 15% higher on my AVX1 Sandy Bridge Xeon. (haven't benchmarked Haswell yet)