happycube / ld-decode

Software defined LaserDisc decoder
GNU General Public License v3.0
305 stars 80 forks source link

Investigate OpenCL acceleration using pyopencl/pyvkfft #855

Open happycube opened 1 year ago

happycube commented 1 year ago

I finally decided to look into GPU acceleration after playing with whisper.cpp and realizing that OpenCL was still Actually Useful(tm). (Seriously, someone should've nudged me a while ago. Maybe I'm stubborner than I think I am... ;) )

This would involve a bit of refactoring, but if it gets a 2x performance boost it'd be worth it:

I'm still in the testing phase. On my main test platform (Dell T3600 w/6-core Sandy Bridge and a Geforce 3060 12GB) pyvkfft is 150% faster at the standard blocksize (64K samples), and ~15x faster at 1MB. So this will probably shift the bottleneck to the TBC even further unless things can be kept on the GPU side most of the time.

I'm also going to look at a secondary test potato^Wplatform, a Mele Quieter3C which has a Celeron N5105 and it's integrated GPU. The latter does pyvkfft benchmarks at about 4-5% the speed of the 3060, but since the CPU does not support AVX(2) it might still be faster. (By the way, the new Nxxx series does have AVX2 and would only lag behind a Haswell i5 because it only has one memory channel. Not bad.)

At a later point, I'm planning on getting my hands on an rk3588 board - if OpenCL is running there with the free drivers I'll try that too, but the A76 has enough SIMD that it might not help.

happycube commented 1 year ago

N5105 notes: Not nearly as slow as I expected. Looks like ~1fps on ld-decode, and most of the slowness in OpenCL is data transfer, so the GPU results are even close.

An Alder Lake-N PC would probably do quite well for ld-decode if you put a nice NVM-e drive in it. These are not your father's Atoms.

happycube commented 1 year ago

I played around with doing int16->complex64 conversion on the GPU side, and it's now 50x faster at 1MB buffers and ~2x with 32K buffers on my main system, if I'm running things right.

(The n3050 is 7.2x/1.87x respectively, I aparently finally got the 3060 properly in play)

So overall speedup will be limited on how much I can use the GPU-side buffers to help with TBC/scaling.

happycube commented 1 year ago

I OpenCL'ified the RF stage, but performance gains are slight now because of pyopencl not releasing the GIL much, on top of switching to the threading model.

https://github.com/happycube/ld-decode/tree/chad-2023.06.11-opencl2

I hear PyCUDA isn't as bad, but since that's locked to nVidia I'd have to make sure the fallback will always work.

typedrat commented 10 months ago

Obviously I'm not telling you to rewrite your whole project in another language, but I think this is really edging into territory that Python is bad at. I don't know if it's ready yet, but in the long term this seems like exactly the sort of thing that Mojo is going to be great for.