glenvt18 / libdvbcsa

GNU General Public License v2.0
11 stars 16 forks source link

Possible transpose optimisation #6

Open richs27 opened 3 years ago

richs27 commented 3 years ago

Hi,

This isn't an "issue", but am new to Github and don't know how to contact you directly.

Thanks for the improvements you've made to this library and I thought the following may be of use, assuming it's relevant or not already considered.

Looking at the "technical_background.txt" file in docs of Eli Bihan's original ffdecsa, in "TRICK NUMBER 6: efficient normal<->slice conversion", he discusses his prefered transpose option, which I believe to be the method used in libdvbcsa.

I wondered if you'd come across Henry Warren's take on the same basic 2x2 transform for transposing 8 x 8 bits, which instead of treating each byte individually, packs 4 bytes each into 2 ints and transforms those. The Bihan method above uses 24 shifts, 48 and, 24 or for this operation and the Warren one, 22 shift, 8 and, 8 or, 12 xor including the packing and unpacking. Details can be found here and in Figure 7-2:

https://books.google.co.nz/books?id=iBNKMspIlqEC&pg=RA1-SL20-PA8&lpg=PP1&focus=viewport

Apologies if this is a waste of time. I'm still quite near the bottom of the C learning curve, but it doesn't seem to be what is currently in use and on the surface would appear to be a tidy way of achieving 8 x dvbcsa_bs_word_t transposes.

Kind regards.

glenvt18 commented 3 years ago

Hi @richs27 Nice to see someone digging into it)) libdvbcsa currently uses hardware matrix transpose on ARM/NEON (see https://github.com/glenvt18/libdvbcsa/blob/master/src/dvbcsa_bs_neon.h#L53) and a XOR-based algorithm shown in Figure 7-3 of the Henry's book on other platforms (https://github.com/glenvt18/libdvbcsa/blob/master/src/dvbcsa_bs_transpose.h). This algorithm works on dvbcsa_bs_word_t words which can be 32-256 bits (AVX2). The transpose macros also do endianess conversion at no extra cost.

richs27 commented 3 years ago

Interesting Arm has hardware capability. I had not spent much time looking at the Figure 7-3 version, as I came across the method needing to do an 8 x 32 transpose.

Given the detailed enhancements you've made elsewhere, I half expected you had this already covered. :)