marian-nmt / marian-dev

Fast Neural Machine Translation in C++ - development repository
https://marian-nmt.github.io
Other
255 stars 123 forks source link

8-bit matrix product #249

Open emjotde opened 6 years ago

emjotde commented 6 years ago

Merge 8-bit support cleanly into current master. I admit I am lost in that code.

kpu commented 6 years ago

I've cleaned it up! 16-bit and 8-bit on SSSE3, AVX2, and AVX512BW.

https://github.com/kpu/intgemm

kpu commented 6 years ago

@emjotde I'm trying to merge but running into a B matrix shaped 512x735. The number of columns is not a multiple of 8.

It's coming from shortlisting: https://github.com/marian-nmt/marian-dev/blob/master/src/layers/generic.h#L153 I also think the shortlist is generating new B matrices on the fly, defeating memoization?

emjotde commented 6 years ago

Ah yes, I think a half-solution would be to make the shortlist a multiple of 8, just add items until that happens.

Also good point with the memoization. The short-listed and transposed matrix is being created only once (there is a reference to it so it survives across decoding steps), but the quantized version will indeed be recomputed.

Ignore this part for now, I want to introduce sentence-level memoization soon, currently we only have graph-life memoization.

kpu commented 6 years ago

I've pushed most of an integration to 85ad45efad278e4337c4919fe1a7cf0544b678a3. Quantization has been split into PrepareA and PrepareB (which includes the "transpose" but actually turns it into something complicated). Can you make the shortlist a multiple of 8?

kpu commented 6 years ago

Just disabling shortlisting for now in my tests...

kpu commented 6 years ago

36d188e6f0dee36e35553f2f1dcf51d32dcecb5e works when shortlisting is disabled with int16.

WNMT run.cpu.sh on dagr which has AVX2, with shortlists disabled: master 25.59 intgemm 25.64

Insignificant BLEU improvement is good (there are some subtle differences in how rounding 0.5 is defined).

int8 is there, just need to:

  1. Make a command line option perhaps including the quantization multiplier.
  2. Change int16:: to int8:: where it occurs in https://github.com/marian-nmt/marian-dev/blob/intgemm/src/graph/expression_operators.cpp
  3. Round shortlist to a multiple of 8.
kpu commented 6 years ago

Dynamic quantization works on an unclipped model, namely /fs/hoenir0/heafield/wnmt/cpu/wnmt/model/model.npz which @emjotde had placed on the CPU Amazon machine under ~/wnmt/model/model.npz .

8-bit, no shortlists: dynamic quantization to 127.0f / max(|value|): 25.93 16-bit (not a useful scenario here--just testing), no shortlists, dynamic quantization to 127.0f / max(|value|): 25.94

As to speed, I'm just using std::minmax_element. Haven't done anything good.

emjotde commented 6 years ago

Pushed a change to make shortlist multiple of 8. Not tested, though, as the intgemm branch does not compile for me.

emjotde commented 6 years ago

Is this currently only meant to work on CPU with avx512 support?

c++: error: unrecognized command line option ‘-mavx512f’
c++: error: unrecognized command line option ‘-mavx512bw’
c++: error: unrecognized command line option ‘-mavx512vl’
c++: error: unrecognized command line option ‘-mavx512dq’
make[2]: *** [src/CMakeFiles/marian.dir/tensors/cpu/intgemm/avx512_gemm.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
[ 48%] Building CXX object src/CMakeFiles/marian.dir/models/encoder_decoder.cpp.o
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = void (*)(const float*, short int*, float, int)]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:78:219:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
   if (__builtin_cpu_supports("avx512f")) {
                                       ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = void (*)(const float*, short int*, float, int, int)]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:79:229:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = void (*)(const short int*, const short int*, float*, float, int, int, int)]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:80:255:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = const char*]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:81:146:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = void (*)(const float*, signed char*, float, int)]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:83:220:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = void (*)(const float*, signed char*, float, int, int)]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:84:230:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = void (*)(const signed char*, const signed char*, float*, float, int, int, int)]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:85:255:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc: In instantiation of ‘T intgemm::{anonymous}::ChooseCPU(T, T, T, T, T) [with T = intgemm::CPUType]’:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:88:92:   required from here
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/intgemm.cc:63:39: error: Parameter to builtin not valid: avx512f
cc1plus: warning: unrecognized command line option "-Wno-deprecated-gpu-targets" [enabled by default]
cc1plus: warning: unrecognized command line option "-Wno-deprecated-gpu-targets" [enabled by default]
make[2]: *** [src/CMakeFiles/marian.dir/tensors/cpu/intgemm/intgemm.cc.o] Error 1
In file included from /home/marcinjd/marian-dev/src/tensors/cpu/intgemm/ssse3_gemm.cc:3:0:
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:30:1: error: ‘void intgemm::Interleave8(__m256i&, __m256i&)’ conflicts with a previous declaration
 } \
 ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:51:1: note: in expansion of macro ‘INTGEMM_INTERLEAVE’
 INTGEMM_INTERLEAVE(__m256i, 256)
 ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:26:14: note: previous declaration ‘void intgemm::Interleave8(__m128i&, __m128i&)’
  inline void Interleave8(type &first, type &second) { \
              ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:48:1: note: in expansion of macro ‘INTGEMM_INTERLEAVE’
 INTGEMM_INTERLEAVE(__m128i, )
 ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:26:14: note: -fabi-version=6 (or =0) avoids this error with a change in mangling
  inline void Interleave8(type &first, type &second) { \
              ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:51:1: note: in expansion of macro ‘INTGEMM_INTERLEAVE’
 INTGEMM_INTERLEAVE(__m256i, 256)
 ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:35:1: error: ‘void intgemm::Interleave16(__m256i&, __m256i&)’ conflicts with a previous declaration
 } \
 ^
/home/marcinjd/marian-dev/src/tensors/cpu/intgemm/interleave.h:51:1: note: in expansion of macro ‘INTGEMM_INTERLEAVE’
 INTGEMM_INTERLEAVE(__m256i, 256)
 ^
kpu commented 6 years ago

It compiles a fat binary with support for multiple vector lengths. At load time it initializes functions based on CPUID.

In other words, it works on CPUs all the way back to SSSE3, but presumes a compiler that knows AVX512.

Does this program compile for you? If not, it's going to be incredibly annoying that older versions consider the types the same for purposes of function overloading.

#include <immintrin.h>

void Foo(__m256i *a) {}
void Foo(__m128i *a) {}

int main() {}

What version is it anyway?

emjotde commented 6 years ago

That wass 4.8, too old I guess. With 5.4 and 7.2 I am getting this:

[ 38%] Building CXX object src/CMakeFiles/marian.dir/tensors/cpu/intgemm/avx512_gemm.cc.o
/tmp/ccEQeqU4.s: Assembler messages:
/tmp/ccEQeqU4.s:413: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm0,%zmm1'
/tmp/ccEQeqU4.s:414: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm4,%zmm6'
/tmp/ccEQeqU4.s:423: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:428: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm11,%zmm12'
/tmp/ccEQeqU4.s:429: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm15,%zmm0'
/tmp/ccEQeqU4.s:438: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:443: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm4,%zmm6'
/tmp/ccEQeqU4.s:444: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm11,%zmm12'
/tmp/ccEQeqU4.s:453: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:458: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm0,%zmm1'
/tmp/ccEQeqU4.s:459: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm4,%zmm7'
/tmp/ccEQeqU4.s:468: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:473: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm14,%zmm15'
/tmp/ccEQeqU4.s:474: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm2,%zmm3'
/tmp/ccEQeqU4.s:483: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:488: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm12,%zmm14'
/tmp/ccEQeqU4.s:489: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm1,%zmm2'
/tmp/ccEQeqU4.s:501: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:505: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm9,%zmm14'
/tmp/ccEQeqU4.s:507: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm1,%zmm2'
/tmp/ccEQeqU4.s:512: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:517: Error: no such instruction: `vinsertf32x8 $0x1,(%rax,%r10),%zmm9,%zmm15'
/tmp/ccEQeqU4.s:518: Error: no such instruction: `vinsertf32x8 $0x1,(%rax,%r9),%zmm2,%zmm3'
/tmp/ccEQeqU4.s:521: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:523: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:524: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:525: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:527: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:528: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:529: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:533: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:534: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:701: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm6,%zmm1'
/tmp/ccEQeqU4.s:702: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm3,%zmm5'
/tmp/ccEQeqU4.s:705: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm9,%zmm10'
/tmp/ccEQeqU4.s:706: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm15,%zmm6'
/tmp/ccEQeqU4.s:717: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:720: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:721: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:722: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:726: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm8,%zmm9'
/tmp/ccEQeqU4.s:730: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm12,%zmm6'
/tmp/ccEQeqU4.s:731: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm4,%zmm3'
/tmp/ccEQeqU4.s:733: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm8,%zmm9'
/tmp/ccEQeqU4.s:743: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:746: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:747: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:748: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:752: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm4,%zmm3'
/tmp/ccEQeqU4.s:756: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm9,%zmm10'
/tmp/ccEQeqU4.s:757: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm6,%zmm1'
/tmp/ccEQeqU4.s:762: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm4,%zmm5'
/tmp/ccEQeqU4.s:767: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:771: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:772: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:773: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:777: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm1,%zmm2'
/tmp/ccEQeqU4.s:779: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm9,%zmm10'
/tmp/ccEQeqU4.s:783: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm11,%zmm6'
/tmp/ccEQeqU4.s:786: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm2,%zmm9'
/tmp/ccEQeqU4.s:795: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:798: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:799: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:800: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:804: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm1,%zmm3'
/tmp/ccEQeqU4.s:806: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm10,%zmm8'
/tmp/ccEQeqU4.s:809: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm6,%zmm1'
/tmp/ccEQeqU4.s:814: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm2,%zmm8'
/tmp/ccEQeqU4.s:817: Error: operand type mismatch for `vpackssdw'
[ 38%] Building CXX object src/CMakeFiles/marian.dir/tensors/cpu/intgemm/avx512_gemm.cc.o
/tmp/ccEQeqU4.s: Assembler messages:
/tmp/ccEQeqU4.s:413: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm0,%zmm1'
/tmp/ccEQeqU4.s:414: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm4,%zmm6'
/tmp/ccEQeqU4.s:423: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:428: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm11,%zmm12'
/tmp/ccEQeqU4.s:429: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm15,%zmm0'
/tmp/ccEQeqU4.s:438: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:443: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm4,%zmm6'
/tmp/ccEQeqU4.s:444: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm11,%zmm12'
/tmp/ccEQeqU4.s:453: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:458: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm0,%zmm1'
/tmp/ccEQeqU4.s:459: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm4,%zmm7'
/tmp/ccEQeqU4.s:468: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:473: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm14,%zmm15'
/tmp/ccEQeqU4.s:474: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm2,%zmm3'
/tmp/ccEQeqU4.s:483: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:488: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm12,%zmm14'
/tmp/ccEQeqU4.s:489: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm1,%zmm2'
/tmp/ccEQeqU4.s:501: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:505: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r10),%zmm9,%zmm14'
/tmp/ccEQeqU4.s:507: Error: no such instruction: `vinsertf32x8 $0x1,0(%r13,%r9),%zmm1,%zmm2'
/tmp/ccEQeqU4.s:512: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:517: Error: no such instruction: `vinsertf32x8 $0x1,(%rax,%r10),%zmm9,%zmm15'
/tmp/ccEQeqU4.s:518: Error: no such instruction: `vinsertf32x8 $0x1,(%rax,%r9),%zmm2,%zmm3'
/tmp/ccEQeqU4.s:521: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:523: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:524: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:525: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:527: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:528: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:529: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:533: Error: operand type mismatch for `vpunpcklwd'
/tmp/ccEQeqU4.s:534: Error: operand type mismatch for `vpunpckhwd'
/tmp/ccEQeqU4.s:701: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm6,%zmm1'
/tmp/ccEQeqU4.s:702: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm3,%zmm5'
/tmp/ccEQeqU4.s:705: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm9,%zmm10'
/tmp/ccEQeqU4.s:706: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm15,%zmm6'
/tmp/ccEQeqU4.s:717: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:720: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:721: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:722: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:726: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm8,%zmm9'
/tmp/ccEQeqU4.s:730: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm12,%zmm6'
/tmp/ccEQeqU4.s:731: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm4,%zmm3'
/tmp/ccEQeqU4.s:733: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm8,%zmm9'
/tmp/ccEQeqU4.s:743: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:746: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:747: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:748: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:752: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm4,%zmm3'
/tmp/ccEQeqU4.s:756: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm9,%zmm10'
/tmp/ccEQeqU4.s:757: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm6,%zmm1'
/tmp/ccEQeqU4.s:762: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm4,%zmm5'
/tmp/ccEQeqU4.s:767: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:771: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:772: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:773: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:777: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm1,%zmm2'
/tmp/ccEQeqU4.s:779: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm9,%zmm10'
/tmp/ccEQeqU4.s:783: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm11,%zmm6'
/...tmp/ccEQeqU4.s:786: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm2,%zmm9'
/tmp/ccEQeqU4.s:795: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:798: Error: operand type mismatch for `vpackssdw'
/tmp/ccEQeqU4.s:799: Error: operand type mismatch for `vpacksswb'
/tmp/ccEQeqU4.s:800: Error: operand type mismatch for `vpmaxsb'
/tmp/ccEQeqU4.s:804: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r13),%zmm1,%zmm3'
/tmp/ccEQeqU4.s:806: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r8),%zmm10,%zmm8'
/tmp/ccEQeqU4.s:809: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%r10),%zmm6,%zmm1'
/tmp/ccEQeqU4.s:814: Error: no such instruction: `vinsertf32x8 $0x1,(%r15,%rbx),%zmm2,%zmm8'
/tmp/ccEQeqU4.s:817: Error: operand type mismatch for `vpackssdw'
...
kpu commented 6 years ago

Looks like the assembler doesn't want to support AVX512BW and AVX512DQ instructions even though the compiler has the intrinsics for them.

Valhalla is running Ubuntu 16.04 and it compiles fine with gcc 5.4.

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'm tempted to add a compilation test then just give up on that file if the compiler doesn't like it.

kpu commented 6 years ago

Can you send me as --version? I think your binutils is ancient.

emjotde commented 6 years ago
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.5-2ubuntu1~14.04.1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.5 (Ubuntu 4.8.5-2ubuntu1~14.04.1) 
Using built-in specs.
COLLECT_GCC=g++-5
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.1-2ubuntu1~14.04' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=gcc4-compatible --disable-libstdcxx-dual-abi --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.1 20160904 (Ubuntu 5.4.1-2ubuntu1~14.04) 
Using built-in specs.
COLLECT_GCC=g++-7
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.2.0-1ubuntu1~14.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=gcc4-compatible --disable-libstdcxx-dual-abi --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.2.0 (Ubuntu 7.2.0-1ubuntu1~14.04) 
kpu commented 6 years ago

I've added a cmake compilation test to guard against older compilers and assemblers, omitting avx512 in such cases.

gcc 5.4.1 and gcc 7.2.0 are fine, but your binutils is too old for avx512.

Regarding gcc 4, all the functions I wanted are inline and it was an ABI issue so static is nice.

kpu commented 6 years ago

Fixed gcc 4 support with static (it can't do avx512 but it will compile everything else). I was even able to compile intgemm on machines maintained by our computing history department (they prefer the title computing support).

emjotde commented 6 years ago

Compiles. How can I test this? Is it using 8bit automatically on avx512?

kpu commented 6 years ago

It's currently hardcoded to use 8-bit on everything (except autotuning kicks in but it shouldn't do much).

https://github.com/marian-nmt/marian-dev/blob/intgemm/src/graph/expression_operators.cpp#L269 https://github.com/marian-nmt/marian-dev/blob/intgemm/src/graph/expression_operators.cpp#L309

I'm writing code to optimize the minmax.

emjotde commented 6 years ago

It's very slow though, is that the min-max?

kpu commented 6 years ago

Now with vectorized max absolute value for various instruction sets. I also fixed some testing of 7-bit that I had accidentally checked in which was damaging BLEU. Should be much much faster now.

TODO: I have untested and unintegrated code for column selection in PrepareB quantized format. The idea is that we'll quantize and prepare the word embeddings first then column select from that.

emjotde commented 6 years ago

Something seems to be very wrong though. On an Amazon machine with AVX512 the 16bit version runs in 65 seconds, your 8bit version in 254 seconds. BLEU has improved, 26.0 vs 25.9

kpu commented 6 years ago

Column selection in PrepareB quantized format is tested with these functions:

static void Int16::SelectColumnsB(const int16_t *input, int16_t *output, int rows, const int *cols_begin, const int *cols_end);
static void Int8::SelectColumnsB(const int8_t *input, int8_t *output, int rows, const int *cols_begin, const int *cols_end);

It is not integrated yet. The plan is to call PrepareB on W first, do column selection, then send it into GEMM. Which means GEMM shouldn't do PrepareB on it again. So we need some way of tracking this. I can check Type but Type describes element size, not the ordering of the elements so going forward it will be ambiguous whether a value was prepared as an A or as a B.

Moreover, we currently depend on a static_pointer_cast to retrieve the scaling value from the child operator. But that won't work anymore because the child operator will be column selection. I could make a common base class and carry the scaling value. . . but it doesn't really belong in operators anyway.

emjotde commented 6 years ago

Why would it be so much worse than 16-bit? Memoization would not work for 16-bit quantization either?

And the transpose should actually be memoized; it happens before the column selection, right?

kpu commented 6 years ago

Only 8-bit does max absolute value to pick a qunatization multiplier which is never going to be fast. 16-bit just does 1024. Standard practice is to do it once on B and record activations on A with floats to train a quantization multiplier for every layer: https://www.tensorflow.org/performance/quantization .

I've pushed a mostly-there version. The code as checked in is working, but slow. I want to comment https://github.com/marian-nmt/marian-dev/blob/intgemm/src/layers/generic.h#L145 (vanilla column selection) and uncomment https://github.com/marian-nmt/marian-dev/blob/intgemm/src/layers/generic.h#L147 (use column selection in prepared format) but it's producing poor quality at the moment for unknown reasons.

Also, https://github.com/marian-nmt/marian-dev/blob/intgemm/src/graph/expression_operators.cpp#L214 is horrible. I can figure out from a tensor what the type is but not while building a graph.

kpu commented 6 years ago

Ok, I have the post-quantization column selection working. Turns out I didn't think to define hash() and equals() for the column selection operator. It's probably cleaner to put this behind CopyCols, though then I'll really need a way to keep track of the layout of the matrix.

Next is seeing how much time is spent finding the max absolute value for A.

emjotde commented 5 years ago

@kpu are we keeping this open?

alvations commented 4 years ago

Just to check-in on this, this PR is obsolete as of v1.9.0 (that would materialize soon), is that right?

Is the --fp16 option using something else or @kpu implementation of the quantization? This https://github.com/marian-nmt/marian-dev/tree/master/src/tensors/cpu/sharp? And when we cite, we cite it as an implementation of https://arxiv.org/pdf/1705.01991.pdf ?

emjotde commented 4 years ago

You are confusing multiple things. --fp16 is GPU-side half-float decoding. 8-bit CPU decoding will arrive in version 1.9. I am reviewing the PR internally this week, should appear before end of week if nothing gets in the way. Thanksgiving holidays might.

What exactly are you using? If within Marian the correct citation is the Marian paper or the recent WNGT paper.

emjotde commented 4 years ago

As for the stuff mentioned in this issue specifically, @XapaJIaMnu wanted to take another attempt at benchmarking with current code?

XapaJIaMnu commented 4 years ago

@emjotde , will do once the 8bit gets merged.