Open animetosho opened 2 years ago
Hi, I actually forgot what CPU I used. I think it's one of my servers' (Intel CPU) but I am not sure. Yes, gf-complete had region functions and I could have included them. However gf-complete was already deleted for a patent reason and my goal was to show the performance of gf-nishida-16 without hardware acceleration and I intently did not include it as it might surpass gf-nishida-region-16 (haha). That said, gf-nishida-region-16 still has a hidden technique I didn't describe in the paper and it may be even faster than them.
Except for gf-complete, I didn't see any region functions in other libraries. BTW, gf-ff-32 and gf-ff-64 use wrong programs. I'll update them soon.
I see, thanks for the response.
In terms of performance for region multiplies, SIMD techniques are much faster than a log-table implementation (I'm aware that there are concerns over an almost certainly invalid patent, but won't discuss here), but even without SIMD, a split table lookup (for GF16) is likely better for region multiply. The main issue with log/exp tables is that they blow past a CPU's L2 cache (many Intel CPUs have a 256KB L2), which hurts performance.
I've actually been interested in the log technique for GPU compute, as CPU SIMD techniques don't work there. The classic implementation (128KB log table + 128KB exp table) takes up too much space to fit in the GPU's local/const memory, so I've been looking at ways to reduce the size of these without slowing compute too much.
Most optimisation seems to be focused on GF8, so glad to see some investigation in GF16.
Hi, if you are interested in using only 128KB, please take a look at some new files in gf-bench/common and gf-bench/multiplication/gf-nishida-region-16 I have just uploaded. The one using GF16crtRegionTbl() requires only 128KB per static coefficient but it might be pretty useless in real world. My distributed data system RNCDDS (https://www.researchgate.net/publication/319588764_RNCDDS_-_Random_Network_coded_distributed_data_system) uses 9 static coefficients alternately and spends 128KB * 9 as a result. It won't fit 256KB L2 cache. Well, please let me know if you find a good solution. I'd love to use it for my system. Thanks.
Thanks for the update. Computing a full product table is interesting, as reduces the lookup operations, but does require computing 128KB of data for each coefficient used.
This 128KB table can be size-reduced (which also reduces the time to compute it) by splitting up the lookup operation. This can be achieved via the distributive nature of multiplication, and breaking each 16-bit input word into high/low byte halves:
let b = 256x + y # where 0 <= x <= 255, 0 <= y <= 255
# i.e. x is the high byte, y is the low byte
a * b = a * (256x + y) = (256a)x + (a)y
We can then compute 2x 256-entry tables: a *256a
and a *a
table
The main loop will need to do two lookups (one for each byte) and add the halves:
for (j = 0; j < SPACE / sizeof(uint16_t); j++) {
uint16_t bj = b[j];
d[j] = gf_256a[bj >> 8] ^ gf_a[bj & 0xff];
}
The compute here is slightly heavier, but the lookup table now consumes only 1KB, which not only fits in an L2 cache, but also an L1 cache. This often ends up being faster because L1 cache is much faster than L2.
I wouldn't call this a log table approach any more (since you're not doing log/exp operations in the main loop), but it's perhaps the fastest non-SIMD technique I know of.
I've also experimented with a variant of this technique, which uses larger tables (12KB) to reduce lookups by 1/4. In practice, it seems to work better on some CPUs, worse on others.
A GPU may have 16-64KB of const memory, so for a log-table implementation, 128+128KB is too big. I have found some tricks to reduce this though.
Otherwise, if you're interested, for CPU, I've listed the fastest multiplication techniques I know of here.
I don't know enough about RNC, but if there's only a few coefficients ever used, there may be some tricks available.
Thanks for sharing very interesting information. I will try the method you mentioned in the near future. I'm currently working on a certain algorithm which also uses GF. If the proposed method works, it will be great contribution to our project.
I implemented the method and got pretty good results. Especially, it seems to work well with an ARM based CPU. Take a look at the latest gf.[ch] and gf-bench/multiplication/gf-nishida-region-16/gf-bench.c if you are interested. Thank you.
Yep, being more cache friendly should help across the board.
You can take the approach one step further: ARM's TBL
instruction or x86's PSHUFB
instruction can be used to perform multiple lookups at once, but only across a 16 entry table. We can uses these instructions, but need to break each input word into 4x 4-bit parts.
Actually, since these instructions only give you a one byte result (as opposed to two bytes), you'll need to repeat the above twice (once for the low byte of the result, another for the high byte), so a total of 8 'lookups'. However, you can apply this to 16 bytes at a time, which provides a significant speed boost
The downside is the higher complexity in which to implement. Most high performance implementations choose this algorithm however.
Oh, thank you. I'll look at them. I really appreciate your advice.
I have uploaded new benchmark results and have updated the English technical paper. The SIMD version shows really really good performance. Thanks.
Nice, and thanks for the extra details!
Did you also test GF-Complete's 16-bit region multiply, by any chance?
No. I didn't even know they had 16bit functions. I'll check it. Thanks.
Hi. I checked gf_complete.h but could not find multiply_region.w16. Does it exist anywhere? I'm using the library downloaded from https://github.com/ceph/gf-complete. Thanks.
8/16-bit region support is under multiply_region.w32, just initialize the 'gf' object with the correct parameters.
As long as I read their paper, I guess
gf_init_easy(&gf, 16); gf.multiply_region.w32
is OK but I'm not sure. Anyway, I have added it to the repo and have added its benchmark results to the technical paper. Their 16bit region seems a little slower than gf-nishida-region-16's SSE/NEON versions. Also, I have added an AVX version of gf-nishida-region-16. It's even faster. Thanks.
Awesome!
Is the code for gf-complete-region-16 available? I see the benchmark results, but not the test code.
The algorithms and implementations between the two are largely the same, so I'd mostly expect similar performance.
I did notice that you're only measuring the inner loop of your implementation, whilst GF-Complete includes the coefficient set-up time. Also GF-Complete is designed for aligned memory, so could suffer a bit if the supplied buffer isn't aligned.
I suspect this won't have that much of an effect, given you're testing with a 5000KB buffer, but something to consider.
Their NEON implementation is less efficient though, so should be beatable.
Regarding the 5000KB buffer, you might find it starts limiting performance on the AVX2 kernel; a benchmark of my GF16 code:
Thank you for your reply.
Is the code for gf-complete-region-16 available? I see the benchmark results, but not the test code.
I forgot to git add it. Now it's available.
I did notice that you're only measuring the inner loop of your implementation,
I know that's unfair for gf-complete but since it's closer to our use case, I purposely did so. That said, as I don't think it's a good idea to leave it for comparison, I moved the loading part into the loop. I guess it is a little too much to move the table creation part to the loop, so I left it outside. The benchmark results were not actually different. Please look at the updated technical paper.
Also GF-Complete is designed for aligned memory, so could suffer a bit if the supplied buffer isn't aligned.
OK, I have replaced malloc() with aligned_alloc() but haven't started the benchmark with it yet. It will finish by tomorrow.
Regarding the 5000KB buffer, you might find it starts limiting performance on the AVX2 kernel;
Thank you for important information. Ummm, we use 5 to 16MB buffers in our system but will consider to use different sizes.
I always appreciate your suggestions and information.
The benchmark results with aligned_alloc() have been uploaded. There were no differences except for slight speedup with multiplication/gf-complete-64.
Thanks for adding the test!
How did you install/compile GF-Complete?
I noticed your benchmark always compiles with AVX, which means your 128-bit code will be using VEX-encoded instructions. If GF-Complete was built without -march=native
(like if you installed it via your OS' repository), it'll be using SSE-encoded instructions.
Testing on a Xeon L3426 (no AVX support), after removing the -mavx2
compile switch and adding #ifdef __AVX2__
around your 256-bit blocks, I get:
gf-nishida-region-16
Two step table lookup : 1339121
One step table lookup : 575995
One step lookup /w small tbls: 477370
One step lookup by SSE : 156774
gf-complete-region-16
156726
So I get roughly the same speed between the two, which is expected.
I moved the loading part into the loop.
It's entirely possible that actually does nothing. Those are loop constants, so an optimising compiler is free to just move it outside the loop. And checking the assembly I get, it's exactly what the the compiler does for me.
But I checked with the table creation in the loop, and it didn't make any noticeable difference, which isn't surprising, so it's mostly a moot point.
Ummm, we use 5 to 16MB buffers in our system but will consider to use different sizes.
I'm not sure how you're using it exactly, but for Reed-Solomon style matrix multiplication, it basically means you want to chunk the iteration to maximise cache usage.
By the way, there's a slightly faster way to interleave the bytes
// (untested, but it's something like this)
v_0 = _mm_shuffle_epi8(input_0, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200));
v_1 = _mm_shuffle_epi8(input_1, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200));
input_l = _mm_unpacklo_epi64(v_0, v_1);
input_h = _mm_unpackhi_epi64(v_0, v_1);
If possible though, it's better to do de/interleave outside of main processing loop - this is mostly useful if you can re-use the inputs.
Thanks for your response.
How did you install/compile GF-Complete? I noticed your benchmark always compiles with AVX, which means your 128-bit code will be using VEX-encoded instructions. If GF-Complete was built without
-march=native
(like if you installed it via your OS' repository), it'll be using SSE-encoded instructions.
I git cloned https://github.com/ceph/gf-complete and probably used gutogen.sh. Makefile didn't include -march=native, so gf-complete was not built with it. OK. I didn't know about VEX-encoded instructions and will test with no -mavx2. I'll update the benchmark results soon.
It's entirely possible that actually does nothing.
Yes, I noticed that right after git push. I'll test with the table creation part inside the loop, too.
I'm not sure how you're using it exactly, but for Reed-Solomon style matrix multiplication, it basically means you want to chunk the iteration to maximise cache usage.
This is a little annoying problem for us.... To avoid the overheads for copying data too frequently from OS kernel's network buffers (mbuf, mbuf cluster), we use pretty large buffers (there are also other reasons but I forget). However, we may get better encoding/decoding speeds with small buffers as you say. We'll test in the future. Thanks for important information.
// (untested, but it's something like this) v_0 = _mm_shuffle_epi8(input_0, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200)); v_1 = _mm_shuffle_epi8(input_1, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200)); input_l = _mm_unpacklo_epi64(v_0, v_1); input_h = _mm_unpackhi_epi64(v_0, v_1);
Unfortunately, I got a slower speed with gf-bench/multiplication/gf-nishida-region-16's SSE + Core i7 i7-10710U. The results (elapsed time) are like: (Original) One step lookup by SSE : 56952 (New) One step lookup by SSE : 63579
The above code seems faster but I'm not very familiar with SSE and don't know why.
Anyway, thank you for your advice and information.
I have uploaded new benchmark results with:
Well, there was no big difference with gf-nishida-region-16-SSE and I saw slowdown with gf-nishida-region-16-2 (one step table lookup), which is reasonable because the table has 65536 entries.
Thanks.
To avoid the overheads for copying data too frequently from OS kernel's network buffers (mbuf, mbuf cluster), we use pretty large buffers (there are also other reasons but I forget). However, we may get better encoding/decoding speeds with small buffers as you say. We'll test in the future.
Yes, continue using large buffers to reduce context switching overheads and similar.
Note that the receive buffer does not have to be the same size as the processing buffer (in ParPar, the default read size is 4MB, but it often gets processed using ~32KB chunks).
Unfortunately, I got a slower speed with gf-bench/multiplication/gf-nishida-region-16's SSE + Core i7 i7-10710U.
Actually that's probably right. I recall that Skylake/Haswell are port 5 bound on this problem, so putting more ops onto port 5 likely slows things down, even though you've reduced the number of instructions. The effect may be different on a CPU with more shuffle capacity (e.g. AMD Zen).
Well, there was no big difference with gf-nishida-region-16-SSE
That's interesting - my benchmarks showed a gain with AVX over SSSE3 on Skylake, so surprising that you see none. Are you sure you're running the -sse version?
I'm guessing you forgot to commit some file, as your code doesn't compile.
Thank you for your response.
I'm guessing you forgot to commit some file, as your code doesn't compile.
Yes, you are right. I forgot to git add gf-bench/*/gf-nishida-region-16/gf-bench-sse.c They are now available and please try.
cd gf-bench/multiplication/gf-nishida-region-16
make bench
will output the results. Let me know if you find something weird.
Thanks for updating that.
Unfortunately, I don't have access to a Skylake CPU at the moment, but testing on a Zen1 VM (not the most reliable) I get:
Two step table lookup : 871197
One step table lookup : 437740
One step lookup /w small tbls: 559330
One step lookup by AVX : 128992
One step lookup by SSE : 195634
gf-complete-region-16 169936
Or via bench-all:
Multiplication
Algorithm, Speed (MB/s)
gf-complete-region-16, 2609.105984
gf-nishida-region-16-1, 550.524138
gf-nishida-region-16-2, 1094.335106
gf-nishida-region-16-3, 978.054893
gf-nishida-region-16-4-SSE, 2582.543018
gf-nishida-region-16-4-AVX, 3316.981650
Looks like your tests are fine - I even checked the assembly (compiled with GCC 7) and they look similar:
So I'm not sure why you're getting such a difference - maybe it's just like that on your CPU, or I'm doing something wrong.
Oh well.
I use FreeBSD 13 and tested with both clang-13 and gcc-10 but still get no difference with gf-nishida-region-16-4-SSE. I also tested with i5-6600K and the result was almost same (elapsed time):
One step lookup by SSE: 59632 (No difference from -march=native)
gf-complete-region-16: 62218 (No difference from -march=native for libgf_complete)
even with libgf_complete with -march=avx2 option. Well, I don't know why but there's a reason somewhere.
Thanks.
I also tested with i5-6600K and the result was almost same
i7-10710U and i5-6600K are both Intel Skylake CPUs, so should demonstrate the same behaviour.
On the above Zen1 system:
-mssse3
One step lookup by SSE : 197322
-mssse3 -mavx
One step lookup by SSE : 186569
GF-Complete (default)
162982
GF-Complete with `./configure --enable-avx`
139320
Maybe AVX doesn't show any gain on Skylake for this test, though I don't know where the difference comes from.
OK, thanks. I tried with i7-9700k + FreeBSD 13 and Ryzen 5800X + Ubuntu 20.04 and got the following results:
i7-9700k + FreeBSD 13
-mssse3
One step lookup by SSE : 42002
-mssse3 -mavx
One step lookup by SSE : 41997
Ryzen 5800X + Ubuntu 20.04
-mssse3
One step lookup by SSE : 30212
-mssse3 -mavx
One step lookup by SSE : 28891
Ryzen 5800X + Ubuntu 20.04 shows slight speedup but i7-9700k + FreeBSD 13 does not. This may be due to OS specific default make options but I'm not sure. I'm not familiar with assembly but the binaries with/without -mavx are different. I could test with another Linux machine but need to get permission to log in.
Intel 6600K, 9700K and 10710U are all Skylake (almost all Intel Core 6000 - 10999 CPUs are Skylake), so you're just re-testing the same config essentially. Ryzen 5800X is a different uArch though (Zen3).
The OS shouldn't affect the speed in theory (unless it's doing something weird with AVX). I don't have any FreeBSD setups to test though.
-mavx
makes all SSE instructions use VEX coding, which helps eliminate a few move instructions, so slightly more efficient.
Okay. I thought Coffee/Comet Lake was different but I noticed they are all based on Skylake architecture. There's one more thing to share. I changed the buffer size from 5120000 to 4194304 and increased the repeat times from 100 to 500 as you mentioned that 5MB starts limiting the AVX2 performance. Though I don't know if it is related to the performance and it also affects AVX, I got different results with 10710U. The following are the typical results:
SSE : 230936
SSE + AVX: 206284
On average, SSE + AVX seems approximately 10% faster.
Thanks.
That's interesting, thanks for sharing that result.
Unfortunately doesn't explain the difference with GF-Complete, but at least is in line with what I've generally found with enabling AVX!
This is just a thank you message to you animetosho. I could write a new research paper using the technique you told me and presented the paper last week. It is about a new communication algorithm and got the best paper award in two categories at UEMCON 2023 (https://ieee-uemcon.org/). It is available at https://www.researchgate.net/publication/374776143_Secure_Fast_and_Loss-Tolerant_Communication_with_Hill_Cipher_and_Network_Coding Please take a look!
Thank you.
PS The paper says "We developed a new technique for GF......" but 'we' means mostly you :)
Congratulations!
The core idea from the shuffle based method came from here - it's not something I came up with (though I did a bunch of investigation into other techniques) :)
OK, after all GF-Complete was a great library. Thanks anyway!
Sorry, I have a question. This is my fault that I haven't read the paper carefully. But as long as I use the 4-bit table lookup technique, it's in fact the same technique as GF-Complete's and I guess it has the same potential patent infringement, doesn't it? I found the patent https://patents.google.com/patent/US8683296 and it mentions about the 4-bit nibbles PSHUFB which seems to be the same as gf-nishida-16's. What do you think?
I can't advise you on patents, but I imagine it's likely invalid due to prior art.
Though I would point out on what the claims actually cover:
The system of claim 1, wherein the processing core comprises 16 data registers, each of the data registers comprises 16 bytes; and the parallel multiplier is configured to process the data in units of at least 64 bytes spread over at least four of the data registers at a time.
The system of claim 19, wherein the parallel multiplier comprises two lookup tables for doing concurrent multiplication of 4-bit quantities across 16 byte-sized entries using the PSHUFB (Packed Shuffle Bytes) instruction.
The way I interpret this is that if any of the following apply, the patent doesn't apply:
Also worth pointing out that Intel implements the technique in their isa-l library, and it doesn't look like they have been granted a license.
Again, I can't advise you here, so you'll need to seek someone qualified on the matter.
Thank you very much. I really appreciate your opinion. I'll contact a professional. Again, thank you for your opinion.
Firstly, thanks for your work and releasing it to the public.
Regarding the benchmark, what CPU was this tested on?
I also note that region multiplication wasn't tested for other libraries - was this not a goal, or something intended to be looked at later?