google / gemmlowp

Low-precision matrix multiplication
Apache License 2.0
1.78k stars 451 forks source link

Avx2optimizations #122

Closed rrwinterton closed 6 years ago

rrwinterton commented 6 years ago

Added AVX2 Optimizations

bjacob commented 6 years ago

This looks great overall. Some comments:

Can you remove the Makefile.rrw at top-level? If you have an interest in having that checked-in, we can discuss what sub-directory would make sense... I guess contrib/ .

Can you edit test/test_fixedpoint.cc, at the bottom, to ensure that your new fixedpoint_avx.h code is exercised? See the bottom of this file, with the paths for GEMMLOWP_SSE, GEMMLOWP_NEON etc. You want to RunTests<__m256i> , whatever types you specialized fixedpoint templates to. This is new code, by the way, which explains why you haven't seen it. Another new development here is the support for 16-bit fixed-point arithmetic, used to be 32bit-only. No need to add 16bit fixed-point in this pull request, that can wait, but FYI we are seriously thinking about relying on it for LSTM models.

rrwinterton commented 6 years ago

Yes I can remove the Makefile.rrw at the top level. Just wanted to make it easy to compare sse and avx2 in the builds using the old benchmarks. Yes will test the fixpoint code as it currently isn't being called in the drop right now but has been tested. We were thinking about promoting the output to AVX2 as week in a new pull but will clean up the fixpoint first. We will also do a 16 bit fixed-point if you like and do another pull request. Thanks for the comments.

bjacob commented 6 years ago

(The below is FYI, no action required here)

Side note: there is a difficult issue about ODR violations that we will need to resolve as soon as possible. I don't want it to block this pull request, and I'd rather handle it myself, but FYI I will have no choice but to resolve it before we can pull this into Google's production copies of gemmlowp.

The issue is that as we #ifdef code based on microarchitecture (AVX vs SSE, etc), since this is all inline code (in headers), if a user links together two object files containing the same gemmlowp symbols, one with AVX and the other without, then this is effectively a violation of the one-definition-rule (ODR). https://en.wikipedia.org/wiki/One_Definition_Rule

This is a very hard problem that we've hit it all header-only libraries with micro-architecture-dependent paths. We already hit it with gemmlowp with scalar vs. SSE4 paths. What made it not matter in the end, so far, was that we already have entirely switched to requiring at least SSE4. Likewise on ARM, we already require NEON.

With AVX2, we won't be able to require it everywhere, so we will have AVX2 and SSE4 gemmlowp symbols coexisting --- and again, under the /same/ symbol name, whence the ODR issues.

There have been countless discussions internally about how to handle this problem (it's a lot more general than gemmlowp --- the same issue in Eigen has been a plague).

Ultimately, it's a mistake in gemmlowp's design that the same symbols may have different implementations. Ideally, different micro-architecture code-paths should correspond to different symbols. That would mean that it becomes the responsibility of the application (non-inline) code instantiating gemmlowp inline code, to choose between micro-architecture code paths. The critical point at which this needs to happen, is where a non-inline function calls gemmlowp inline functions.

It's too late to fix this for existing micro-architecture paths (SSE4, NEON) without regressing existing users, and again the issue so far has been mild because SSE4 and NEON are near-ubiquitous in their respective architectures.

But for AVX2, we need a more proper solution.

I think of allowing the user to #define GEMMLOWP_ENABLE_AVX2, which will result in a different namespace being created by gemmlowp: instead of "namespace gemmlowp" it will be "namespace gemmlowp_AVX2". User code will need to adjust accordingly, explicitly.

In the past a variant of this technique has often been employed: in order to make code automatically benefit from the fast path, people did: #define gemmlowp gemmlowp_FASTPATH (where FASTPATH might be AVX2 here). This tended to fix the actual ODR violation in practice, but is brittle: it would allow a user to inadvertently do this in their own inline code calling gemmlowp, only moving the problem down the inline call chain. That's why we currently think it better to require explicit choice of code path by the gemmlowp caller.

rrwinterton commented 6 years ago

It looks like the decision is to move the AVX2 optimizations to an auto-determine CPU capability and add a fast path instead of a specific ISA compilation and selection. If this is the case do you want me to make those changes in a branch of gemmlowp and make a new pull request?

bjacob commented 6 years ago

No, I don't want to make CPU capability auto-detection, and dispatching to a specific code path, part of gemmlowp's scope. That is best done in non-inline code, and gemmlowp is header-only, all inline code. So I see the scope of gemmlowp as strictly providing these code paths, not dispatching between them.

My above comments can be summarized by saying that unfortunately, at the moment, because of how gemmlowp is structured, having multiple code paths compiled may lead to ODR violations (in practice: crashes).

It is not very easy to fix this, and that would require some breaking change to gemmlowp's API.

So for now, I think the best route to taking your code is if you made it explicitly opt-in:

instead of having a #ifdef just detecting if AVX2 extensions are enabled,

#ifdef __AVX2__

have that #ifdef also check that the user explicitly opted in to compiling AVX2 code.

#if defined(__AVX2__) && defined (GEMMLOWP_ENABLE_AVX2)

That will help prevent a situation where users inadvertently mix AVX2 code into non-AVX2 code and get ODR violations. Please write a comment to that effect so that people understand the risks of GEMMLOWP_ENABLE_AVX2.

rrwinterton commented 6 years ago

Added a user compiler option to be used when compiling for IA SIMD optimizations. The results of these are shown below:

//compiler define for AVX2 -D GEMMLOWP_ENABLE_AVX2 //compiler define for SSE4 -D GEMMLOWP_ENABLE_SSE4 //no options for scalar

Test Run Scalar SSE4 AVX2
Graph Latency (ms) 32 12 9
GoogLeNet Latency (ms) 26 96 68
Multi-Thread 10x10x10 (Gflops/s) 1.64 2.961 2.334
Multi-Thread 1000x1000x1000 (Gflops/s) 29.49 75.85 107.8
Single-Thread 10x10x10 (Gflops/s) 1.63 2.975 2.337
Single-Thread 1000x1000x1000 (Gflops/s) 13.36 38.21 56.31
bjacob commented 6 years ago

Thanks! Just one nit: let's not change the behavior for current users who are enjoying SSE4 --- let's keep SSE4 auto-enabled without requiring GEMMLOWP_ENABLE_SSE4. The ODR issues are already a problem there, but it would be another problem to deal with hordes of angry users whom we took SSE4 acceleration away from.

Please just keep GEMMLOWP_ENABLE_AVX2.

rrwinterton commented 6 years ago

Makes sense. Just was trying to be consistent but you are right. Removed SSE compiler option. AVX2 only compiler option in change.

rrwinterton commented 6 years ago

I think the Makefile should be removed now and the -D compiler directive is checked in so I think we are good with this. I must have not have gotten the push and rm to complete last time. Let me know if you would like me to help. Thanks,

bjacob commented 6 years ago

Thanks! This looks good, just needs rebasing to the current state of the master branch --- I could 'rebase and merge' myself, but it's best if you do the rebasing yourself, so you can double check that your rebased branch looks like what you expect.

This should be roughly a matter of:

git checkout master
git pull origin master         # get the latest from the master branch
git checkout avx2optimizations
git rebase master        # rebase against master
git push....        # update this pull request.