Benchmark question - Githubissues

animetosho commented 2 years ago

Firstly, thanks for your work and releasing it to the public.

Regarding the benchmark, what CPU was this tested on?

I also note that region multiplication wasn't tested for other libraries - was this not a goal, or something intended to be looked at later?

scopedog commented 2 years ago

Hi, I actually forgot what CPU I used. I think it's one of my servers' (Intel CPU) but I am not sure. Yes, gf-complete had region functions and I could have included them. However gf-complete was already deleted for a patent reason and my goal was to show the performance of gf-nishida-16 without hardware acceleration and I intently did not include it as it might surpass gf-nishida-region-16 (haha). That said, gf-nishida-region-16 still has a hidden technique I didn't describe in the paper and it may be even faster than them.

scopedog commented 2 years ago

Except for gf-complete, I didn't see any region functions in other libraries. BTW, gf-ff-32 and gf-ff-64 use wrong programs. I'll update them soon.

animetosho commented 2 years ago

I see, thanks for the response.

In terms of performance for region multiplies, SIMD techniques are much faster than a log-table implementation (I'm aware that there are concerns over an almost certainly invalid patent, but won't discuss here), but even without SIMD, a split table lookup (for GF16) is likely better for region multiply. The main issue with log/exp tables is that they blow past a CPU's L2 cache (many Intel CPUs have a 256KB L2), which hurts performance.

I've actually been interested in the log technique for GPU compute, as CPU SIMD techniques don't work there. The classic implementation (128KB log table + 128KB exp table) takes up too much space to fit in the GPU's local/const memory, so I've been looking at ways to reduce the size of these without slowing compute too much.

Most optimisation seems to be focused on GF8, so glad to see some investigation in GF16.

scopedog commented 2 years ago

Hi, if you are interested in using only 128KB, please take a look at some new files in gf-bench/common and gf-bench/multiplication/gf-nishida-region-16 I have just uploaded. The one using GF16crtRegionTbl() requires only 128KB per static coefficient but it might be pretty useless in real world. My distributed data system RNCDDS (https://www.researchgate.net/publication/319588764_RNCDDS_-_Random_Network_coded_distributed_data_system) uses 9 static coefficients alternately and spends 128KB * 9 as a result. It won't fit 256KB L2 cache. Well, please let me know if you find a good solution. I'd love to use it for my system. Thanks.

animetosho commented 2 years ago

Thanks for the update. Computing a full product table is interesting, as reduces the lookup operations, but does require computing 128KB of data for each coefficient used.

This 128KB table can be size-reduced (which also reduces the time to compute it) by splitting up the lookup operation. This can be achieved via the distributive nature of multiplication, and breaking each 16-bit input word into high/low byte halves:

let b = 256x + y    # where 0 <= x <= 255, 0 <= y <= 255
                    # i.e. x is the high byte, y is the low byte
a * b = a * (256x + y) = (256a)x + (a)y

We can then compute 2x 256-entry tables: a *256a and a *a table
The main loop will need to do two lookups (one for each byte) and add the halves:

for (j = 0; j < SPACE / sizeof(uint16_t); j++) {
    uint16_t bj = b[j];
    d[j] = gf_256a[bj >> 8] ^ gf_a[bj & 0xff];
}

The compute here is slightly heavier, but the lookup table now consumes only 1KB, which not only fits in an L2 cache, but also an L1 cache. This often ends up being faster because L1 cache is much faster than L2.

I wouldn't call this a log table approach any more (since you're not doing log/exp operations in the main loop), but it's perhaps the fastest non-SIMD technique I know of.
I've also experimented with a variant of this technique, which uses larger tables (12KB) to reduce lookups by 1/4. In practice, it seems to work better on some CPUs, worse on others.

A GPU may have 16-64KB of const memory, so for a log-table implementation, 128+128KB is too big. I have found some tricks to reduce this though.
Otherwise, if you're interested, for CPU, I've listed the fastest multiplication techniques I know of here.

I don't know enough about RNC, but if there's only a few coefficients ever used, there may be some tricks available.

scopedog commented 2 years ago

Thanks for sharing very interesting information. I will try the method you mentioned in the near future. I'm currently working on a certain algorithm which also uses GF. If the proposed method works, it will be great contribution to our project.

scopedog commented 2 years ago

I implemented the method and got pretty good results. Especially, it seems to work well with an ARM based CPU. Take a look at the latest gf.[ch] and gf-bench/multiplication/gf-nishida-region-16/gf-bench.c if you are interested. Thank you.

animetosho commented 2 years ago

Yep, being more cache friendly should help across the board.

You can take the approach one step further: ARM's TBL instruction or x86's PSHUFB instruction can be used to perform multiple lookups at once, but only across a 16 entry table. We can uses these instructions, but need to break each input word into 4x 4-bit parts.
Actually, since these instructions only give you a one byte result (as opposed to two bytes), you'll need to repeat the above twice (once for the low byte of the result, another for the high byte), so a total of 8 'lookups'. However, you can apply this to 16 bytes at a time, which provides a significant speed boost

The downside is the higher complexity in which to implement. Most high performance implementations choose this algorithm however.

scopedog commented 2 years ago

Oh, thank you. I'll look at them. I really appreciate your advice.

scopedog commented 2 years ago

I have uploaded new benchmark results and have updated the English technical paper. The SIMD version shows really really good performance. Thanks.

animetosho commented 2 years ago

Nice, and thanks for the extra details!

Did you also test GF-Complete's 16-bit region multiply, by any chance?

scopedog commented 2 years ago

No. I didn't even know they had 16bit functions. I'll check it. Thanks.

scopedog commented 2 years ago

Hi. I checked gf_complete.h but could not find multiply_region.w16. Does it exist anywhere? I'm using the library downloaded from https://github.com/ceph/gf-complete. Thanks.

animetosho commented 2 years ago

8/16-bit region support is under multiply_region.w32, just initialize the 'gf' object with the correct parameters.

scopedog commented 2 years ago

As long as I read their paper, I guess

gf_init_easy(&gf, 16); gf.multiply_region.w32

is OK but I'm not sure. Anyway, I have added it to the repo and have added its benchmark results to the technical paper. Their 16bit region seems a little slower than gf-nishida-region-16's SSE/NEON versions. Also, I have added an AVX version of gf-nishida-region-16. It's even faster. Thanks.

animetosho commented 2 years ago

Awesome!

Is the code for gf-complete-region-16 available? I see the benchmark results, but not the test code.

The algorithms and implementations between the two are largely the same, so I'd mostly expect similar performance.
I did notice that you're only measuring the inner loop of your implementation, whilst GF-Complete includes the coefficient set-up time. Also GF-Complete is designed for aligned memory, so could suffer a bit if the supplied buffer isn't aligned.
I suspect this won't have that much of an effect, given you're testing with a 5000KB buffer, but something to consider.

Their NEON implementation is less efficient though, so should be beatable.

Regarding the 5000KB buffer, you might find it starts limiting performance on the AVX2 kernel; a benchmark of my GF16 code:

scopedog commented 2 years ago

Thank you for your reply.

Is the code for gf-complete-region-16 available? I see the benchmark results, but not the test code.

I forgot to git add it. Now it's available.

I did notice that you're only measuring the inner loop of your implementation,

I know that's unfair for gf-complete but since it's closer to our use case, I purposely did so. That said, as I don't think it's a good idea to leave it for comparison, I moved the loading part into the loop. I guess it is a little too much to move the table creation part to the loop, so I left it outside. The benchmark results were not actually different. Please look at the updated technical paper.

Also GF-Complete is designed for aligned memory, so could suffer a bit if the supplied buffer isn't aligned.

OK, I have replaced malloc() with aligned_alloc() but haven't started the benchmark with it yet. It will finish by tomorrow.

Regarding the 5000KB buffer, you might find it starts limiting performance on the AVX2 kernel;

Thank you for important information. Ummm, we use 5 to 16MB buffers in our system but will consider to use different sizes.

I always appreciate your suggestions and information.

scopedog commented 2 years ago

The benchmark results with aligned_alloc() have been uploaded. There were no differences except for slight speedup with multiplication/gf-complete-64.

animetosho commented 2 years ago

Thanks for adding the test!

How did you install/compile GF-Complete?
I noticed your benchmark always compiles with AVX, which means your 128-bit code will be using VEX-encoded instructions. If GF-Complete was built without -march=native (like if you installed it via your OS' repository), it'll be using SSE-encoded instructions.

Testing on a Xeon L3426 (no AVX support), after removing the -mavx2 compile switch and adding #ifdef __AVX2__ around your 256-bit blocks, I get:

gf-nishida-region-16
Two step table lookup        : 1339121
One step table lookup        : 575995
One step lookup /w small tbls: 477370
One step lookup by SSE       : 156774

gf-complete-region-16
156726

So I get roughly the same speed between the two, which is expected.

I moved the loading part into the loop.

It's entirely possible that actually does nothing. Those are loop constants, so an optimising compiler is free to just move it outside the loop. And checking the assembly I get, it's exactly what the the compiler does for me.

But I checked with the table creation in the loop, and it didn't make any noticeable difference, which isn't surprising, so it's mostly a moot point.

Ummm, we use 5 to 16MB buffers in our system but will consider to use different sizes.

I'm not sure how you're using it exactly, but for Reed-Solomon style matrix multiplication, it basically means you want to chunk the iteration to maximise cache usage.

By the way, there's a slightly faster way to interleave the bytes

// (untested, but it's something like this)
v_0 = _mm_shuffle_epi8(input_0, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200));
v_1 = _mm_shuffle_epi8(input_1, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200));
input_l = _mm_unpacklo_epi64(v_0, v_1);
input_h = _mm_unpackhi_epi64(v_0, v_1);

If possible though, it's better to do de/interleave outside of main processing loop - this is mostly useful if you can re-use the inputs.

scopedog commented 2 years ago

Thanks for your response.

How did you install/compile GF-Complete? I noticed your benchmark always compiles with AVX, which means your 128-bit code will be using VEX-encoded instructions. If GF-Complete was built without -march=native (like if you installed it via your OS' repository), it'll be using SSE-encoded instructions.

I git cloned https://github.com/ceph/gf-complete and probably used gutogen.sh. Makefile didn't include -march=native, so gf-complete was not built with it. OK. I didn't know about VEX-encoded instructions and will test with no -mavx2. I'll update the benchmark results soon.

It's entirely possible that actually does nothing.

Yes, I noticed that right after git push. I'll test with the table creation part inside the loop, too.

I'm not sure how you're using it exactly, but for Reed-Solomon style matrix multiplication, it basically means you want to chunk the iteration to maximise cache usage.

This is a little annoying problem for us.... To avoid the overheads for copying data too frequently from OS kernel's network buffers (mbuf, mbuf cluster), we use pretty large buffers (there are also other reasons but I forget). However, we may get better encoding/decoding speeds with small buffers as you say. We'll test in the future. Thanks for important information.

// (untested, but it's something like this) v_0 = _mm_shuffle_epi8(input_0, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200)); v_1 = _mm_shuffle_epi8(input_1, _mm_set_epi32(0x0f0d0b09, 0x07050301, 0x0e0c0a08, 0x06040200)); input_l = _mm_unpacklo_epi64(v_0, v_1); input_h = _mm_unpackhi_epi64(v_0, v_1);

Unfortunately, I got a slower speed with gf-bench/multiplication/gf-nishida-region-16's SSE + Core i7 i7-10710U. The results (elapsed time) are like: (Original) One step lookup by SSE : 56952 (New) One step lookup by SSE : 63579

The above code seems faster but I'm not very familiar with SSE and don't know why.

Anyway, thank you for your advice and information.

scopedog commented 2 years ago

I have uploaded new benchmark results with:

gf-nishida-region-16-SSE with no -mavx2 option
gf-nishida-reiong-16-* benchmark results with table creation inside the loop

Well, there was no big difference with gf-nishida-region-16-SSE and I saw slowdown with gf-nishida-region-16-2 (one step table lookup), which is reasonable because the table has 65536 entries.

Thanks.

animetosho commented 2 years ago

To avoid the overheads for copying data too frequently from OS kernel's network buffers (mbuf, mbuf cluster), we use pretty large buffers (there are also other reasons but I forget). However, we may get better encoding/decoding speeds with small buffers as you say. We'll test in the future.

Yes, continue using large buffers to reduce context switching overheads and similar.
Note that the receive buffer does not have to be the same size as the processing buffer (in ParPar, the default read size is 4MB, but it often gets processed using ~32KB chunks).

Unfortunately, I got a slower speed with gf-bench/multiplication/gf-nishida-region-16's SSE + Core i7 i7-10710U.

Actually that's probably right. I recall that Skylake/Haswell are port 5 bound on this problem, so putting more ops onto port 5 likely slows things down, even though you've reduced the number of instructions. The effect may be different on a CPU with more shuffle capacity (e.g. AMD Zen).

Well, there was no big difference with gf-nishida-region-16-SSE

That's interesting - my benchmarks showed a gain with AVX over SSSE3 on Skylake, so surprising that you see none. Are you sure you're running the -sse version?
I'm guessing you forgot to commit some file, as your code doesn't compile.

scopedog commented 2 years ago

Thank you for your response.

I'm guessing you forgot to commit some file, as your code doesn't compile.

Yes, you are right. I forgot to git add gf-bench/*/gf-nishida-region-16/gf-bench-sse.c They are now available and please try.

cd gf-bench/multiplication/gf-nishida-region-16
make bench

will output the results. Let me know if you find something weird.

animetosho commented 2 years ago

Thanks for updating that.

Unfortunately, I don't have access to a Skylake CPU at the moment, but testing on a Zen1 VM (not the most reliable) I get:

Two step table lookup        : 871197
One step table lookup        : 437740
One step lookup /w small tbls: 559330
One step lookup by AVX       : 128992
One step lookup by SSE       : 195634
gf-complete-region-16   169936

Or via bench-all:

Multiplication
Algorithm, Speed (MB/s)
gf-complete-region-16, 2609.105984
gf-nishida-region-16-1, 550.524138
gf-nishida-region-16-2, 1094.335106
gf-nishida-region-16-3, 978.054893
gf-nishida-region-16-4-SSE, 2582.543018
gf-nishida-region-16-4-AVX, 3316.981650

Looks like your tests are fine - I even checked the assembly (compiled with GCC 7) and they look similar:

nishida-region-16

cd8: 66 45 0f 6f d3 movdqa xmm10,xmm11 cdd: 66 41 0f 6f fe movdqa xmm7,xmm14 ce2: 66 0f 6f 0a movdqa xmm1,XMMWORD PTR [rdx] ce6: 48 83 c2 20 add rdx,0x20 cea: 66 41 0f 6f f5 movdqa xmm6,xmm13 cef: 48 83 c1 20 add rcx,0x20 cf3: 66 0f 6f 42 f0 movdqa xmm0,XMMWORD PTR [rdx-0x10] cf8: 66 0f 6f d1 movdqa xmm2,xmm1 cfc: 66 0f 71 d1 08 psrlw xmm1,0x8 d01: 66 41 0f 6f ec movdqa xmm5,xmm12 d06: 66 0f 6f d8 movdqa xmm3,xmm0 d0a: 66 0f 71 d0 08 psrlw xmm0,0x8 d0f: 66 41 0f db d1 pand xmm2,xmm9 d14: 66 41 0f db d9 pand xmm3,xmm9 d19: 66 0f 67 c8 packuswb xmm1,xmm0 d1d: 66 41 0f 6f c7 movdqa xmm0,xmm15 d22: 66 0f 67 d3 packuswb xmm2,xmm3 d26: 66 0f 6f d9 movdqa xmm3,xmm1 d2a: 66 41 0f db c8 pand xmm1,xmm8 d2f: 66 0f 6f e2 movdqa xmm4,xmm2 d33: 66 0f 71 d3 04 psrlw xmm3,0x4 d38: 66 41 0f db d0 pand xmm2,xmm8 d3d: 66 0f 71 d4 04 psrlw xmm4,0x4 d42: 66 41 0f db e0 pand xmm4,xmm8 d47: 66 41 0f db d8 pand xmm3,xmm8 d4c: 66 44 0f 38 00 d2 pshufb xmm10,xmm2 d52: 66 0f 38 00 c2 pshufb xmm0,xmm2 d57: 66 0f 38 00 fc pshufb xmm7,xmm4 d5c: 66 41 0f 6f d2 movdqa xmm2,xmm10 d61: 66 44 0f 6f 54 24 20 movdqa xmm10,XMMWORD PTR [rsp+0x20] d68: 66 0f 38 00 f1 pshufb xmm6,xmm1 d6d: 66 44 0f 38 00 d4 pshufb xmm10,xmm4 d73: 66 0f ef c7 pxor xmm0,xmm7 d77: 66 0f 38 00 eb pshufb xmm5,xmm3 d7c: 66 41 0f 6f e2 movdqa xmm4,xmm10 d81: 66 44 0f 6f 14 24 movdqa xmm10,XMMWORD PTR [rsp] d87: 66 0f ef f0 pxor xmm6,xmm0 d8b: 66 44 0f 38 00 d1 pshufb xmm10,xmm1 d91: 66 0f ef d4 pxor xmm2,xmm4 d95: 66 0f ef ee pxor xmm5,xmm6 d99: 66 41 0f 6f ca movdqa xmm1,xmm10 d9e: 66 44 0f 6f 54 24 10 movdqa xmm10,XMMWORD PTR [rsp+0x10] da5: 66 44 0f 38 00 d3 pshufb xmm10,xmm3 dab: 66 0f ef d1 pxor xmm2,xmm1 daf: 66 0f 6f c5 movdqa xmm0,xmm5 db3: 66 41 0f 6f da movdqa xmm3,xmm10 db8: 66 0f ef da pxor xmm3,xmm2 dbc: 66 0f 60 c3 punpcklbw xmm0,xmm3 dc0: 66 0f 68 eb punpckhbw xmm5,xmm3 dc4: 0f 29 41 e0 movaps XMMWORD PTR [rcx-0x20],xmm0 dc8: 0f 29 69 f0 movaps XMMWORD PTR [rcx-0x10],xmm5 dcc: 49 39 d7 cmp r15,rdx dcf: 0f 85 03 ff ff ff jne cd8

complete-region-16

1a1e8: 66 41 0f 6f f8 movdqa xmm7,xmm8 1a1ed: 66 41 0f 6f e4 movdqa xmm4,xmm12 1a1f2: 66 0f 6f 02 movdqa xmm0,XMMWORD PTR [rdx] 1a1f6: 48 83 c0 20 add rax,0x20 1a1fa: 66 41 0f 6f ee movdqa xmm5,xmm14 1a1ff: 48 83 c2 20 add rdx,0x20 1a203: 66 0f 6f 52 f0 movdqa xmm2,XMMWORD PTR [rdx-0x10] 1a208: 66 0f 6f d8 movdqa xmm3,xmm0 1a20c: 66 41 0f db c1 pand xmm0,xmm9 1a211: 66 0f 6f ca movdqa xmm1,xmm2 1a215: 66 41 0f db d1 pand xmm2,xmm9 1a21a: 66 0f 71 d3 08 psrlw xmm3,0x8 1a21f: 66 41 0f 6f f0 movdqa xmm6,xmm8 1a224: 66 0f 71 d1 08 psrlw xmm1,0x8 1a229: 66 0f 67 cb packuswb xmm1,xmm3 1a22d: 66 44 0f 6f 54 24 10 movdqa xmm10,XMMWORD PTR [rsp+0x10] 1a234: 66 0f 67 d0 packuswb xmm2,xmm0 1a238: 66 0f db f1 pand xmm6,xmm1 1a23c: 66 0f 71 d1 04 psrlw xmm1,0x4 1a241: 66 41 0f db c8 pand xmm1,xmm8 1a246: 66 0f db fa pand xmm7,xmm2 1a24a: 66 0f 71 d2 04 psrlw xmm2,0x4 1a24f: 66 41 0f db d0 pand xmm2,xmm8 1a254: 66 0f 6f de movdqa xmm3,xmm6 1a258: 66 44 0f 38 00 d1 pshufb xmm10,xmm1 1a25e: 66 0f 6f c7 movdqa xmm0,xmm7 1a262: 66 41 0f 6f ff movdqa xmm7,xmm15 1a267: 66 0f 38 00 e2 pshufb xmm4,xmm2 1a26c: 66 0f 38 00 f8 pshufb xmm7,xmm0 1a271: 66 0f 38 00 e8 pshufb xmm5,xmm0 1a276: 66 41 0f 6f c5 movdqa xmm0,xmm13 1a27b: 66 0f 38 00 c2 pshufb xmm0,xmm2 1a280: 66 0f 6f d4 movdqa xmm2,xmm4 1a284: 66 41 0f 6f e3 movdqa xmm4,xmm11 1a289: 66 0f 38 00 e6 pshufb xmm4,xmm6 1a28e: 66 0f 6f 74 24 20 movdqa xmm6,XMMWORD PTR [rsp+0x20] 1a294: 66 0f ef c5 pxor xmm0,xmm5 1a298: 66 0f ef d7 pxor xmm2,xmm7 1a29c: 66 0f 38 00 f3 pshufb xmm6,xmm3 1a2a1: 66 0f 6f 1c 24 movdqa xmm3,XMMWORD PTR [rsp] 1a2a6: 66 0f ef e0 pxor xmm4,xmm0 1a2aa: 66 0f 38 00 d9 pshufb xmm3,xmm1 1a2af: 66 41 0f 6f ca movdqa xmm1,xmm10 1a2b4: 66 0f ef d6 pxor xmm2,xmm6 1a2b8: 66 0f ef dc pxor xmm3,xmm4 1a2bc: 66 0f ef ca pxor xmm1,xmm2 1a2c0: 66 0f 6f c3 movdqa xmm0,xmm3 1a2c4: 66 0f 60 d9 punpcklbw xmm3,xmm1 1a2c8: 66 0f 68 c1 punpckhbw xmm0,xmm1 1a2cc: 0f 29 58 f0 movaps XMMWORD PTR [rax-0x10],xmm3 1a2d0: 0f 29 40 e0 movaps XMMWORD PTR [rax-0x20],xmm0 1a2d4: 48 39 c1 cmp rcx,rax 1a2d7: 0f 85 0b ff ff ff jne 1a1e8

So I'm not sure why you're getting such a difference - maybe it's just like that on your CPU, or I'm doing something wrong.
Oh well.

scopedog commented 2 years ago

I use FreeBSD 13 and tested with both clang-13 and gcc-10 but still get no difference with gf-nishida-region-16-4-SSE. I also tested with i5-6600K and the result was almost same (elapsed time):

One step lookup by SSE: 59632 (No difference from -march=native)
gf-complete-region-16: 62218 (No difference from -march=native for libgf_complete)

even with libgf_complete with -march=avx2 option. Well, I don't know why but there's a reason somewhere.

Thanks.

animetosho commented 2 years ago

I also tested with i5-6600K and the result was almost same

i7-10710U and i5-6600K are both Intel Skylake CPUs, so should demonstrate the same behaviour.

On the above Zen1 system:

-mssse3
One step lookup by SSE       : 197322

-mssse3 -mavx
One step lookup by SSE       : 186569

GF-Complete (default)
162982

GF-Complete with `./configure --enable-avx`
139320

Maybe AVX doesn't show any gain on Skylake for this test, though I don't know where the difference comes from.

scopedog commented 2 years ago

OK, thanks. I tried with i7-9700k + FreeBSD 13 and Ryzen 5800X + Ubuntu 20.04 and got the following results:

i7-9700k + FreeBSD 13
-mssse3
One step lookup by SSE       : 42002
-mssse3 -mavx
One step lookup by SSE       : 41997

Ryzen 5800X + Ubuntu 20.04
-mssse3
One step lookup by SSE       : 30212
-mssse3 -mavx
One step lookup by SSE       : 28891

Ryzen 5800X + Ubuntu 20.04 shows slight speedup but i7-9700k + FreeBSD 13 does not. This may be due to OS specific default make options but I'm not sure. I'm not familiar with assembly but the binaries with/without -mavx are different. I could test with another Linux machine but need to get permission to log in.

animetosho commented 2 years ago

Intel 6600K, 9700K and 10710U are all Skylake (almost all Intel Core 6000 - 10999 CPUs are Skylake), so you're just re-testing the same config essentially. Ryzen 5800X is a different uArch though (Zen3).

The OS shouldn't affect the speed in theory (unless it's doing something weird with AVX). I don't have any FreeBSD setups to test though.

-mavx makes all SSE instructions use VEX coding, which helps eliminate a few move instructions, so slightly more efficient.

scopedog commented 2 years ago

Okay. I thought Coffee/Comet Lake was different but I noticed they are all based on Skylake architecture. There's one more thing to share. I changed the buffer size from 5120000 to 4194304 and increased the repeat times from 100 to 500 as you mentioned that 5MB starts limiting the AVX2 performance. Though I don't know if it is related to the performance and it also affects AVX, I got different results with 10710U. The following are the typical results:

SSE : 230936
SSE + AVX: 206284

On average, SSE + AVX seems approximately 10% faster.

Thanks.

animetosho commented 2 years ago

That's interesting, thanks for sharing that result.
Unfortunately doesn't explain the difference with GF-Complete, but at least is in line with what I've generally found with enabling AVX!

scopedog commented 1 year ago

This is just a thank you message to you animetosho. I could write a new research paper using the technique you told me and presented the paper last week. It is about a new communication algorithm and got the best paper award in two categories at UEMCON 2023 (https://ieee-uemcon.org/). It is available at https://www.researchgate.net/publication/374776143_Secure_Fast_and_Loss-Tolerant_Communication_with_Hill_Cipher_and_Network_Coding Please take a look!

Thank you.

PS The paper says "We developed a new technique for GF......" but 'we' means mostly you :)

animetosho commented 1 year ago

Congratulations!

The core idea from the shuffle based method came from here - it's not something I came up with (though I did a bunch of investigation into other techniques) :)

scopedog commented 1 year ago

OK, after all GF-Complete was a great library. Thanks anyway!

scopedog commented 1 year ago

Sorry, I have a question. This is my fault that I haven't read the paper carefully. But as long as I use the 4-bit table lookup technique, it's in fact the same technique as GF-Complete's and I guess it has the same potential patent infringement, doesn't it? I found the patent https://patents.google.com/patent/US8683296 and it mentions about the 4-bit nibbles PSHUFB which seems to be the same as gf-nishida-16's. What do you think?

animetosho commented 1 year ago

I can't advise you on patents, but I imagine it's likely invalid due to prior art.

Though I would point out on what the claims actually cover:

The system of claim 1, wherein the processing core comprises 16 data registers, each of the data registers comprises 16 bytes; and the parallel multiplier is configured to process the data in units of at least 64 bytes spread over at least four of the data registers at a time.

The system of claim 19, wherein the parallel multiplier comprises two lookup tables for doing concurrent multiplication of 4-bit quantities across 16 byte-sized entries using the PSHUFB (Packed Shuffle Bytes) instruction.

The way I interpret this is that if any of the following apply, the patent doesn't apply:

the data registers are not 16 bytes (e.g. AVX2 enables 32 byte registers)
you are not processing data in units of >=64 bytes (e.g. with 16 byte registers, an unroll cycle of 2 means you're processing data in units of 32 bytes)
you're not using two lookup tables (e.g. GF16 uses eight)

Also worth pointing out that Intel implements the technique in their isa-l library, and it doesn't look like they have been granted a license.

Again, I can't advise you here, so you'll need to seek someone qualified on the matter.

scopedog commented 1 year ago

Thank you very much. I really appreciate your opinion. I'll contact a professional. Again, thank you for your opinion.

scopedog / gf-nishida-16

Benchmark question #1