AVX - Githubissues

markNZed commented 7 years ago

Hi,

I found your work searching for a bit-matrix transpose using SIMD. Seems very close to what we need. AVX is becoming more popular and I was wondering if that function needs to be modified to leverage AVX instructions ?

mischasan commented 7 years ago

Sure. I've been moving my code to AVX2 (not the bmx proc) proprietarily. I won't make an upgraded bmx proprietary. But who is "we"?

On 21 March 2017 at 09:25, markNZed notifications@github.com wrote:

Hi,

I found your work searching for a bit-matrix transpose using SIMD. Seems very close to what we need. AVX is becoming more popular and I was wondering if that function needs to be modified to leverage AVX instructions ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMQw4Nrg2UWKj_Su2Q-6iictRPnfMks5rn_n0gaJpZM4MkD9a .

markNZed commented 7 years ago

If I understood you, you have moved some of the code base to AVX2 but are not planning on publishing that source code. But you may make an AVX version of the bmx procedure available to the public. Is that right ?

"We" is me and a dev who I've asked to help me because he has some SIMD experience. We have been using boost.simd to do some benchmarking.

The GPL could be a problem as I want to develop a commercial application (for engineering). There is no problem sharing changes that we might make to the bmx proc but the GPL would require releasing all the code it is linked with and that is problematic.

The app would run on industry server farms so managing different SIMD implementations/generations is an issue. We were thinking of using gcc intrinsics for this. One idea would be to map bmx to intrinsics. You did not want to use intrinsics ?

mischasan commented 7 years ago

That's correct. The reason I first post everything with GPL is my curiosity about who is using it; what kind of applications. If LGPL works for you, that's fine for me.

I've switched my own praxis to testing cpuid on the fly, and using alternate code paths for SSE2 and AVX2. If you compile with gcc, you may note that some versions do not support SSE2 at all when you compile for 32-bit processors. The code uses the gcc intrinsics, either way. I haven't seen any other vector op sets (AMD 3dnow, ARM Neon) worth supporting. You (or the dev) have any perspective on that?

On 22 March 2017 at 01:17, markNZed notifications@github.com wrote:

If I understood you, you have moved some of the code base to AVX2 but are not planning on publishing that source code. But you may make an AVX version of the bmx procedure available to the public. Is that right ?

"We" is me and a dev who I've asked to help me because he has some SIMD experience. We have been using boost.simd to do some benchmarking.

The GPL could be a problem as I want to develop a commercial application (for engineering). There is no problem sharing changes that we might make to the bmx proc but the GPL would require releasing all the code it is linked with and that is problematic.

The app would run on industry server farms so managing different SIMD implementations/generations is an issue. We were thinking of using gcc intrinsics for this. One idea would be to map bmx to intrinsics. You did not want to use intrinsics ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-288327539, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMVwoJoCxa2VTXy2fzujUMcNk3D7cks5roNkTgaJpZM4MkD9a .

markNZed commented 7 years ago

Only targetting x86 at this stage.

I tried compiling on Ubuntu 16.04:

cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c sseutil.c:1:18: fatal error: plat.h: No such file or directory

Is that file missing from the repo ?

mischasan commented 7 years ago

Oh carp. Yes. Sigh. Here: this is faster than my updating github (srsly)

On 22 March 2017 at 10:26, markNZed notifications@github.com wrote:

Only targetting x86 at this stage.

I tried compiling on Ubuntu 16.04:

cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c sseutil.c:1:18: fatal error: plat.h: No such file or directory

Is that file missing from the repo ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-288475629, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMfdMDlrMzbBvFqDoOW0sN7k-HXkgks5roVm7gaJpZM4MkD9a .

mischasan commented 7 years ago

The file is in my util/ repo as well

On 22 March 2017 at 13:19, Mischa Sandberg mischa_sandberg@telus.net wrote:

Oh carp. Yes. Sigh. Here: this is faster than my updating github (srsly)

On 22 March 2017 at 10:26, markNZed notifications@github.com wrote:

Only targetting x86 at this stage.

I tried compiling on Ubuntu 16.04:

cc -g -MMD -fPIC -pthread -fdiagnostics-show-option -fno-strict-aliasing -fstack-protector --param ssp-buffer-size=4 -Wall -Werror -Wextra -Wcast-align -Wcast-qual -Wformat=2 -Wformat-security -Wmissing-prototypes -Wnested-externs -Wpointer-arith -Wshadow -Wstrict-prototypes -Wunused -Wwrite-strings -Wno-attributes -Wno-cast-qual -Wno-error -Wno-unknown-pragmas -Wno-unused-parameter -O3 -I/usr/local/include -D_FORTIFY_SOURCE=2 -D_GNU_SOURCE -I. -c -o sseutil.o sseutil.c sseutil.c:1:18: fatal error: plat.h: No such file or directory

Is that file missing from the repo ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-288475629, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMfdMDlrMzbBvFqDoOW0sN7k-HXkgks5roVm7gaJpZM4MkD9a .

markNZed commented 7 years ago

Also missing msutil.h and sock.h

I ran make then make test which gives:

make test
cc   -pthread  -L/usr/local/lib        ssebmx_t.o libsse.a tap.o bitmat.o     -lstdc++  -lm    -o ssebmx_t
bitmat.o: In function `bitmat_trans':
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx'
/home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx_m'
collect2: error: ld returned 1 exit status
<builtin>: recipe for target 'ssebmx_t' failed
make: *** [ssebmx_t] Error 1

mischasan commented 7 years ago

My apologies for leading it in that state. If you pull my util repo, it has all the files required. I'm currently in an odd position having to recover my git remote state/switch interfaces. Had not really expected anyone to use that project in a while.

On 22 March 2017 at 14:09, markNZed notifications@github.com wrote:

Also missing msutil.h and sock.h

I ran make then make test which gives:

make test cc -pthread -L/usr/local/lib ssebmx_t.o libsse.a tap.o bitmat.o -lstdc++ -lm -o ssebmx_t bitmat.o: In function bitmat_trans': /home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference tossebmx' /home/propacov/shared/proto/i8051-07/src/tests/primitives/sse2/bitmat.c:80: undefined reference to `ssebmx_m' collect2: error: ld returned 1 exit status
: recipe for target 'ssebmx_t' failed make: *** [ssebmx_t] Error 1 — You are receiving this because you commented. Reply to this email directly, view it on GitHub , or mute the thread .

markNZed commented 7 years ago

No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks.

mischasan commented 7 years ago

Sure. I'm going to be in the air for most of today. Pardon, but what tz are you in? And does your server farm include AVX512 boxes?

On 23 March 2017 at 01:07, markNZed notifications@github.com wrote:

No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-288644946, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a .

mischasan commented 7 years ago

This is what I can do off my notebook. Passes ssebmx unit tests on my side.

On 23 March 2017 at 05:29, Mischa Sandberg mischa_sandberg@telus.net wrote:

Sure. I'm going to be in the air for most of today. Pardon, but what tz are you in? And does your server farm include AVX512 boxes?

On 23 March 2017 at 01:07, markNZed notifications@github.com wrote:

No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-288644946, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a .

mischasan commented 7 years ago

And here's an update with AVX2 (mm256) support. And I'm happy to convert to Apache license if you'll satisfy my curiosity --- if that can be worded in a way that doesn't impinge on any competitive secret.

On 23 March 2017 at 06:08, Mischa Sandberg mischa_sandberg@telus.net wrote:

This is what I can do off my notebook. Passes ssebmx unit tests on my side.

On 23 March 2017 at 05:29, Mischa Sandberg mischa_sandberg@telus.net wrote:

Sure. I'm going to be in the air for most of today. Pardon, but what tz are you in? And does your server farm include AVX512 boxes?

On 23 March 2017 at 01:07, markNZed notifications@github.com wrote:

No problem, it is worth the effort if we can use the code. I resolved the missing files (downloaded the 3 headers from your utils package). But ran into the compile error reported in my previous message. Can you get the bmx test running ? The GNUMakefile and rules are new to me so not so easy to quickly understand where the issue is. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-288644946, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMcIgnVSI-xlZrMS9P6l05zJgxdqeks5roihXgaJpZM4MkD9a .

markNZed commented 7 years ago

Hi,

I don't see updates to the repo, are you using attachments with these messages ? I don't think I can access those.

I'm in France. I imagine the user of our software will have AVX512 boxes. But I don't have a server farm. I plan to do testing on cloud infrastructure e.g. AWS.

markNZed commented 7 years ago

With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ?

mischasan commented 7 years ago

Yes they were zip attachments. When I get back I'll update github (need ssh key/ cert)

On Mar 26, 2017 6:52 AM, "markNZed" notifications@github.com wrote:

With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-289286016, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMbMj8yEspQ817aZ9FTm5SOLCno6Xks5rpm2pgaJpZM4MkD9a .

mischasan commented 7 years ago

no chg. uses 256 for as much as fits; falls through to 128 for what doesn't.

On Mar 26, 2017 6:52 AM, "markNZed" notifications@github.com wrote:

With 256 or 512bit registers does the optimal size of the bit matrix for transposition change ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-289286016, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMbMj8yEspQ817aZ9FTm5SOLCno6Xks5rpm2pgaJpZM4MkD9a .

markNZed commented 7 years ago

If you like you could upload to https://expirebox.com/ it is very simple, no login, provides a link to the file (which gets deleted after 48hrs).

markNZed commented 7 years ago

For bmx, does AVX provide improved instructions or is the only benefit larger registers ?

mischasan commented 7 years ago

Sure https://expirebox.com/download/791aa29d46fa7dda158d8b6f52893ea3.html The cpuid check broke on one other older pc I had access to last night. Other than that, ssebmx_t.pass speaks for itself. Lucky you, in France. Paris, Menton and St Remy de Provence are some of my favourite places to be.

On 26 March 2017 at 08:22, markNZed notifications@github.com wrote:

If you like you could upload to https://expirebox.com/ it is very simple, no login, provides a link to the file (which gets deleted after 48hrs).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-289291128, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMZYOnMor2fpijbANGv2oAExp63JQks5rpoKqgaJpZM4MkD9a .

mischasan commented 7 years ago

No improved instructions for this particular app ... and the core op (movemask) is not implemented for AVX512.

On 26 March 2017 at 08:25, markNZed notifications@github.com wrote:

For bmx, does AVX provide improved instructions or is the only benefit larger registers ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-289291307, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMeSakIRqi7GnGLo_ODVEEwQKeTD1ks5rpoNVgaJpZM4MkD9a .

markNZed commented 7 years ago

I have a hard time understanding why the CPU don't provide native support for a bitwise transpose, it seems such a fundamental building block. Do you see why that hasn't happened ?

The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel Core i5 on my laptop). Thanks!

Have you tried benchmarking between clang and gcc ? I was surprised to see how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test cases, seemed to make better use of the ymm/xmm registers.

Yeah lucky to be in France, so much come down to luck...

mischasan commented 7 years ago

Haha I am stealing time to type, let alone perftest.

This ssebmx doesn't use multiple registers. I expect no better than what gcc 4.4 does unrolling trivial loops. It could be modified to use multiple registers to make better use of cache lines. That's not through auto-vectorization, though.

AVX (opinion) is part of Intel's war with AMD --- that's why SSE3+ and AVX+ are such messy unorthogonal arch. AMD lost, so now Intel has gone back and improved REP MOVSB et al which is what most people needed.

If I were re-implementing APL :-) I'd think about AVX2 more. It might also help on table-driven charset conversion. I stuck to SSE2 because it was pretty much guaranteed everywhere.

Well have fun. My home is Vancouver (Canada), it's good even if not France (or Germany). Est-ce que vous soyez français?

On 26 March 2017 at 09:59, markNZed notifications@github.com wrote:

I have a hard time understanding why the CPU don't provide native support for a bitwise transpose, it seems such a fundamental building block. Do you see why that hasn't happened ?

The zip ran fine on my machine, I only tried ssebmx_t (I'm using an Intel Core i5 on my laptop). Thanks!

Have you tried benchmarking between clang and gcc ? I was surprised to see how much better clang-3.8 was than gcc-6.2 on some auto-vectorization test cases, seemed to make better use of the ymm/xmm registers.

Yeah lucky to be in France, so much come down to luck...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-289298009, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMXDV_e2SHV8U0rNulVMzcI7E7JxEks5rppl8gaJpZM4MkD9a .

markNZed commented 7 years ago

Non, néo-zélandais, beaucoup de chance la aussi!

markNZed commented 7 years ago

This is a bit of a diverging thread but I hesitate to create new issues for questions. The bmx is 16x8 and I am wondering, if we are targeting a size of 256 x W (where W is typically less than 512). Are there changes to the algorithm that could match up with the initial row count of 256 and improve performance ? Or is it best to just break that up into 16x8 chunks. Thanks.

mischasan commented 7 years ago

Short answer: doesn't help SSE2, probably won't help AVX2.

I did some SSE2-only timing a couple years ago, aiming at using the same input cache line (64 bytes) immediately in the "gather" (INP) loops, There was a factor of 1.5...2 improvement for the [8x16] becoming [8 x 64], but it only applied for up to [8 x 512] arrays (special case; someone was interested in that). At that point fetch from RAM (not cache) became the limiting factor. That second loop [8 x ...] is slower than the first one [16 x ...].

I have not tried perftesting what else discussed. A quick small test of changing INP() and OUT() to use induction variables, and so avoid IMUL, suggests it's a quick win.

I'm occupied by a large customer; will be happy to rethink this in two weeks. You haven't mentioned what the application is for this (in even general terms), I assume then you won't.

On 30 March 2017 at 23:47, markNZed notifications@github.com wrote:

This is a bit of a diverging thread but I hesitate to create new issues for questions. The bmx is 16x8 and I am wondering, if we are targeting a size of 256 x W (where W is typically less than 512). Are there changes to the algorithm that could match up with the initial row count of 256 and improve performance ? Or is it best to just break that up into 16x8 chunks. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-290631223, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMVsFc9Wmp6GmiZBY6GOAHSGBjDzuks5rrKGagaJpZM4MkD9a .

markNZed commented 7 years ago

Nice idea with INP and OUT. I would hope that the hardware could prefetch but in any case memory will be the bottleneck. It is premature to optimise now. I will be late next week before I can do profiling and the current bmx may be plenty enough.

The application is analysing decompressed trace files from digital circuit simulation. One dimension of the matrix is time/cycles and the another dimension inputs. The matrix can be quite big (e.g. GBs).

mischasan commented 7 years ago

Thanks; and that's all I wanted to know. Best of luck to you (folks) on that. Cache-line caching does a lot. For transpose, the access pattern is too hard for prefetch to spot; and if you widen the contiguous access on the gather (INP) side, you create sparser action on the scatter side. I'll switch to induction indexes for INP and OUT as soon as I get a chance to exhale.

On 31 March 2017 at 09:03, markNZed notifications@github.com wrote:

Nice idea with INP and OUT. I would hope that the hardware could prefetch but in any case memory will be the bottleneck. It is premature to optimise now. I will be late next week before I can do profiling and the current bmx may be plenty enough.

The application is analysing decompressed trace files from digital circuit simulation. One dimension of the matrix is time/cycles and the another dimension inputs. The matrix can be quite big (e.g. GBs).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-290754493, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMfqfRNr70ycKoCire2DhJKyIRdrZks5rrSO3gaJpZM4MkD9a .

markNZed commented 7 years ago

Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1...

I should probably mention that we are looking to transpose blocks (kBs) not the entire matrix (potentially GBs). So the scatter can be limited.

mischasan commented 7 years ago

unfortunately not. i tested prefetch heavily for a version of memcpy using sse2. it is a minor improvement when there is a single output target cache line. bmx does scatter output. always happy to be proven wrong.

On Apr 1, 2017 1:13 AM, "markNZed" notifications@github.com wrote:

Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-290904764, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMQNpdkvjSoHHss64QUws_o6uEu30ks5rrgcXgaJpZM4MkD9a .

mischasan commented 7 years ago

Okay, here's the final cut (from my side). It has no IMULs. It uses AVX2 if that is defined at compile-time. A run-time test for CPUID is cheap; I'm afraid I have to move on and won't be doing that.

To complete that previous comment about prefetch: it has a limited use for prefetching target memory, prior to updating bytes in a new cache line. Some CPU's appear to have a limited queue for prefetches; if you do it too often, performance starts to degrade below having no prefetch at all.

On 1 April 2017 at 07:06, Mischa Sandberg mischa_sandberg@telus.net wrote:

unfortunately not. i tested prefetch heavily for a version of memcpy using sse2. it is a minor improvement when there is a single output target cache line. bmx does scatter output. always happy to be proven wrong.

On Apr 1, 2017 1:13 AM, "markNZed" notifications@github.com wrote:

Could __builtin_prefetch be a big help with that ? If the gather/scatter work on a block that fits in L1...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-290904764, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMQNpdkvjSoHHss64QUws_o6uEu30ks5rrgcXgaJpZM4MkD9a .

markNZed commented 7 years ago

Hi, thanks! Can you please upload it to github or https://expirebox.com

mischasan commented 7 years ago

Right: https://expirebox.com/download/a943062e34c58f520bef1902227f161a.html

On 5 April 2017 at 04:42, markNZed notifications@github.com wrote:

Hi, thanks! Can you please upload it to github or https://expirebox.com

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-291835344, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMRQwN7reuqfn4iJIpLWFWDy53e5rks5rs34PgaJpZM4MkD9a .

markNZed commented 7 years ago

Hi, we ran some benchmarking and got slightly better results with code based on http://stackoverflow.com/questions/41778362/how-to-efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at 400.961 MB/s Thanks for your support!

mischasan commented 7 years ago

Terrific! Non-hardware-specific is always preferrable. Good luck with your application of it.

On 13 April 2017 at 03:05, markNZed notifications@github.com wrote:

Hi, we ran some benchmarking and got slightly better results with code based on http://stackoverflow.com/questions/41778362/how-to- efficiently-transpose-a-2d-bit-matrix targetting a 64x64 matrix. It was surprising. 940.423 MB/s vs 747.659 MB/s and AVX2 was actually slower at 400.961 MB/s Thanks for your support!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mischasan/sse2/issues/3#issuecomment-293847869, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjJMf8lAzpMxMfuTIcFyb7YOmSqbs9Dks5rvfN4gaJpZM4MkD9a .

mischasan / sse2

AVX #3