Per-buffer SIMD kernels - The case of LDPC

Hello! First of all this is not an issue - more of a question / discussion about something I may consider PRing into Volk.

Short version : I've worked on LDPC decoding to implement in SatDump, and I'm thinking this could be a decent addition to Volk / GNU Radio but this comes with a few challenges differing from how things are usually done in Volk.

Long version : In SatDump & for my own needs, a proper and decently performant LDPC implementation has been something I had been in need of for quite a while. While there are already quite a few projects able to decode generic (and CCSDS) LDPC codes such as AFF3CT, gr-ccsds un-merged code and more, all of them are pretty slow especially for the usecase of processing large amount of satellite data possibly in realtime. Hence, none of them were suitable as-is. I'll skip the details of the "story" (you can see some of them here https://github.com/daniestevez/ldpc-toolbox/issues/1), but I've ended up porting gr-ccsds' code over and writing SIMD (SSE, AVX2 and NEON) versions.

If anyone wants to see the code : Generic : https://github.com/altillimity/SatDump/blob/master/src-core/common/codings/ldpc/ldpc_decoder_generic.cpp SSE : https://github.com/altillimity/SatDump/blob/master/plugins/simd_extensions/simd_sse41/ldpc_decoder/ldpc_decoder_sse.cpp AVX2 : https://github.com/altillimity/SatDump/blob/master/plugins/simd_extensions/simd_avx2/ldpc_decoder/ldpc_decoder_avx.cpp NEON : https://github.com/altillimity/SatDump/blob/master/plugins/simd_extensions/simd_neon/ldpc_decoder/ldpc_decoder_neon.cpp SIMD Selection : https://github.com/altillimity/SatDump/blob/master/src-core/common/codings/ldpc/ldpc_decoder.cpp and https://github.com/altillimity/SatDump/blob/master/src-core/common/cpu_features.cpp

However, handling those SIMD versions of the decoder led me to implementing a Volk-like system of my own, which obviously led me to thinking about adding those implementation as Volk kernels which could also benefit other projects and remove the need to handle this on my own as I am currently doing.

But as what I said previous implies this comes with major challenges : All kernels currently take an aligned buffer of a size which does NOT change depending on the underlying implementation... But in this instance, it would need to not be the case. The way SIMD is done above (and usually is for LDPC) is by processing "x" frames at once instead of a single one using SIMD operators. Hence, a plain C generic implementation will process a single frame at once, my SSE version 8 and AVX2 16. That means the underlying implementation's "simd factor" needs to be known by the user beforehand. (so that "x" frames can be buffered and packed accordingly beforehand)

An easy fix could be simply setting all kernels to process the maximum number of frames the largest SIMD can process at once (so, in this insteance taking AVX512, it would be 32), but that's a pretty awful solution in my opinion.

What do you all think? And about the idea of adding this sort of kernels to Volk?

@Aang23 first off, a fast LDPC decoder implementation is pretty cool. We have some convolutional and polar decoder code in VOLK. LDPC might fit as well. Speaking of those other implementations, I see a problem: reliable tests. Personally, I started to re-write VOLK tests with gtest. This system requires more manual implementation than the old system but does not try to do a "one-size fits all".

I had a look at your AVX implementation. It looks nice. However, it deviates from how things are done in VOLK at the moment. VOLK kernels usually don't have state. i.e. every buffer is passed in as a parameter. Also, VOLK kernels are implemented in C. Only the test system etc. are implemented in C++. It would be beneficial to be able to separate the implementation into multiple functions. This is smth VOLK does not support well. I assume, we can improve VOLK in that regard. It would be beneficial in more places.

The inter frame parallelism is another challenge. We might have a system where 1, 8, 32, etc. are available options. e.g. if you want to use 8x parallelism, you will use the SSE kernel even if AVX is available.

It'd be a good idea to discuss this further. As already pointed out, there are some areas where VOLK would need improvements and we need to figure out how we would want to fit your implementation into the VOLK structure.

@jdemel yes, that's part of the reason I've thought of this, as Volk already has similar FEC kernels. As for tests, with everything FEC I'd suggest doing both a synthetic "output test" as it was done before and an AWGN channel simulation and perhaps generate a BER curve. Not especially for automated tests there, but it's a quick way for someone running the tests to tell if they are performing as expected.

I was initially thinking only the generic_cn_kernel would be provided as a Volk kernel, ported over to C (it's already C pretty much either way). It'd take all parameters as initialized by the user (but the parallelism will need to be known beforehand still), leaving preparing / handling buffers to the user.

Perhaps another way could be for Volk to provide a "ldpc_decoder_state_t" struct or similar, which could be initialized by a first kernel to then provide a more generic, less SIMD-sensitive "volk_decode_ldpc(ldpc_decoder_state_tstate, int8_t input, uint8_t *output)" etc. That'd keep it more akin to the C++ implementation without as much having to be handled by the user. That would make more flexible too, as entirely different decoding implementations could rely on the same struct abstracting any differences to the user (which would pretty much only need to initialize 3 buffers).

Yes, if the user wants a specify parallelism... That'd be yet another challenge, as it would have to be selected by the frame count when calling the kernel. That's not something Volk support either, but perhaps in that case manual override by the user could be a solution. I'd assume in most cases only the fastest available will be utilized.

So which functions do you suggest to port to VOLK? The inner loop with SIMD instructions? The whole decoder? Oh wait: This function specifically: https://github.com/altillimity/SatDump/blob/df2fbdf67540a1feb5fa04f5907a846f7b9a456c/plugins/simd_extensions/simd_avx2/ldpc_decoder/ldpc_decoder_avx.cpp#L121

At the moment, we'd have multiplicties: 1 (generic), 8 (SSE, NEON), 16 (AVX), 32 (AVX512) OR did I mix smth up? Is 1, 16, 32, 64 correct? My idea would be to have functions that specifically say: I do 16 inter-frame parallelism etc. Under the hood, we can sort things out, i.e. select the best available kernel. Though, I don't know how to do this exactly at the moment.

I started to work on a new test infrastructure: https://github.com/jdemel/volk/tree/newtest One reason to start that effort were our problems to fix the convolutional decoder. I assume your LDPC code port would benefit from that too.

Do you suggest a specific way forward?

Yes, generic_cn_kernel, I should have been more precise :-) That is the simplest way to keep it similar to how volk currently works.

At the moment, we'd have multiplicties: 1 (generic), 8 (SSE, NEON), 16 (AVX), 32 (AVX512) OR did I mix smth up? Is 1, 16, 32, 64 correct? My idea would be to have functions that specifically say: I do 16 inter-frame parallelism etc. Under the hood, we can sort things out, i.e. select the best available kernel. Though, I don't know how to do this exactly at the moment.

With the current kernels, that is correct (1, 8, 16, 32). Using 16-bits values. In the future I'd like to try and use only 8-bits which could be another variant. Depending on how selection would be done, my worry would it having too large of an impact on performance though.

I started to work on a new test infrastructure: https://github.com/jdemel/volk/tree/newtest One reason to start that effort were our problems to fix the convolutional decoder. I assume your LDPC code port would benefit from that too.

All kernels of the sort will definitely benefit from that!

Do you suggest a specific way forward?

Personally, from my experience with Volk I'd maybe suggest deviating from the "kernel-only" system for some things. It's great for most DSP functions but it does make Volk a bit more challenging to utilize than other (often slower) solutions. Currently, all boilerplate is provided by the user, so taking my experience with the convolutional kernel, using it was a bit... Convoluted :-), and I ended up copying & modifying GNU Radio's implementation (https://github.com/altillimity/SatDump/blob/df2fbdf67540a1feb5fa04f5907a846f7b9a456c/src-core/common/codings/viterbi/cc_decoder.cpp).

In the case of adding the generic_cn_kernel alone, this would also be the case. I'd feel like Volk providing, eg :

struct ldpc_decoder_state_t; // Opaque pointer, with specifics such as ldpc_decoder_state_s16_t also available to the user (in case different decoder implementations need different states, even though very unlikely

void init_ldpc_decoder_state_s16(ldpc_decoder_state_t** state, int num_cn, int num_vn, int max_cn_deg, int num_edge, int* vn_positions, int desired_simd);

void ldpc_decoder_decode_s16_minsum(ldpc_decoder_state_t* state, int8_t* input, uint8_t *output, int num_iters);

void free_ldpc_decoder_state_s16(ldpc_decoder_state_t* state);

This would allow the user to setup a int* vn_positions arrays from a matrix (on their own), then feed that and other parameters to Volk taking care of initializing the struct (abstracting the SIMD aspect), and then being able to call a trivial decoding function without any of the boilerplate. Depending on "desired_simd" and available options, either SSE/AVX could be selected, or the generic one in a for loop etc.

That's probably, in rough lines how I would suggest doing it, as it'd make a it a lot easier for the "average developer" to use it. Perhaps also providing something similar for other similar FEC kernels such as the convolutional one.

However, this is drastically different from Volk's usual ways :-)

My biggest worry with this approach would be that VOLK currently tries to not have state. At least none of the kernels do. This would drastically change with this kernel.

The benefit of VOLK is that it provides a common interface for different implementations. Also, it comes with a system to select the best kernel automatically. I guess these features are the most important things you're looking for.

If we kept the state implementation etc. separate where would you put it? FEC tends to be way more complex than the other kernels that we usually see.

gnuradio / volk

Per-buffer SIMD kernels - The case of LDPC #611