A Sparse CSR Fallback Kernel

mahudu97 commented 4 years ago

When using libxsmm_dfsspmdm_create to JIT a kernel for sparse A, if the number of unique constants does not fit in a register kernel, the fallback is to use a dense routine.

Is there a reason as to why libxsmm_generator_spgemm_csr_asparse is not considered?

A GitHub search shows that libxsmm_generator_spgemm_csr_asparse \ libxsmm_generator_spgemm_csr_kernel is unused throughout the library.

alheinecke commented 4 years ago

We offer "soa" and "packed" kernels. The simple CSR kernel where never needed in recent past and the intrinisic/static kernel have been enough.

There are plans to pack more values into the kernel, but the benefits so far have been minor. What is you concrete application?

mahudu97 commented 4 years ago

The use case is the sparse matrix multiplication that occurs in PyFR.

I have a method that packs more values into the simple CSR kernel, in my fork. It's part of my thesis (Imperial College, Supervisor Prof. Paul Kelly). Link: https://github.com/mahudu97/libxsmm/commit/815d80bab476a5e884190c9f5be74011b5ba0b7d The intention is to make a pull request after evaluating within my thesis. In the PR I'd of course provide a detailed explanation of the method / changes. In summary, it packs up to 120 unique elements in an AVX-512 register file (both DP and SP), and broadcasts with one instruction (3 cycle latency) regardless of the lane the element is stored in.

I had a look at the "soa" kernels. I assume "soa" is Struct of Arrays? (Sorry not too familiar with some of the matrix terminology, if its something else). I noticed it uses VBROADCASTSX as its move A instruction. Does this kernel load the values of A from memory? (as opposed to broadcasting values already in registers, as I believe only lane zero can be broadcast using this instruction)

Finally, I'm not too sure what the packed family of GEMM generators are used for, could you please give me an indication of their use? Thanks.

EDIT: After overlooking the role of libxsmm_fsspmdm.c, it appears to provide a wrapper for allocating a basic CSR structure as well as call the eventual asparse_reg routine. I guess my initial query changes to why does this same API path not attempt to use the soa kernel if reg kernel is not available. I have looked around, and there doesn't appear to be a similar API wrapper for the soa kernel.

I also looked closer at the libxsmm_generator_spgemm_csr_asparse kernel, and it seems to not use vectorisation, so that'd make sense as to why its not considered by libxsmm_dfsspmdm_create

alheinecke commented 4 years ago

Ah, great, yes I talked with Freddie and Paul ~8months ago and that we should write such a kernel:-) We are happy to take the changes. When Freddie and Paul and I talked we had 3 kernel variants in discussions: o RegA -> with permutes o RegA -> with Shuffles o RegA -> with compact A storage and loading form L1$ I think your code is the second case, correct? Yes the SoA kernel loads everything from memory, but I believe it's not applicable to PyFR. Since I know have context, I think you are doing the exactly the right thing. Some background on the SOA kernels can be found here: https://dial3343.org/, http://short.dial3343.org/isc19pap , https://dial3343.org/2018/06/14/libxsmm.html , http://short.dial3343.org/isc17pap

Wrt PR: I suggest you implement it in a way that we have both versions in the code. That means for < 31 unique NNZ we use the old code and for up to 120 we generate your new code?

Can you elaborate more on the 120 limitation? I would have expected ~224 (288) for DP and 448 (2816) for SP, while leaving 4 registers for temp values empty. 120 would mean you use only 15 registers.

Yes, all the magic is between of kernel selection is in libxsmm_fsspmdm.c.

mahudu97 commented 4 years ago

Thanks for the links! My code is indeed the RegA with Shuffles case.

The reason for the 120 limitation is that the method's broadcasting only requires one instruction (3 cycle latency Skylake-SP) - namely VSHUFF64X2. This means only one temp register is required. So then 30 registers are available to store packed A. (other used to accumulate C). This leads to the following register file layout.

regpack_regfile

Broadcasting

The reason for storing a packed value twice, is that it can then be 'broadcasted' with a single VSHUFF64X2, shown by: regpack_vhuf64x2 The same source register is used for both source operands. The encoded selector then picks the same 128-bit source section for every 128-bit dest. section. (so 120 limit also applies to SP as need to store a value 4 times to fill 128-bits)

These were taken from the chapter I'm now writing for my thesis. I'm working on a more detailed explanation, of which I'd be happy to attach an excerpt to this issue once written.

For more than 120 unique non-zeros, there are probably a few multiple-instruction broadcasting techniques, however, I haven't looked extensively for one.

For a PR, it does make sense to split the strategy. Although my evaluation has shown no performance penalty when using the new method, when unique NZ <=31. A split strategy approach would also flow well for a over 120 unique NZ method to be added.

mahudu97 commented 4 years ago

A more detailed walk-through of the solution: register_packing.pdf

alheinecke commented 4 years ago

Awesome, this is great and I also now understand as well the 120 element limitation. How many cases can be covered by this? What is the coverage increase when going from 31->120?

With vpermt2q or vpermt2d we can shuffle arbitrary (of course at a higher cost). However, this would now allow for up 232 unique values DP and 464 SP, assuming we read the 8/16 broadcast permute operands (1 full zmm) from memory, but from L1 data cache.

Other option is to hold the permute operands in registers that would mean 21 registers for data in DP: 218 = 168 and 1716 for SP = 272. This was the case I mentioned with vprem previously.

mahudu97 commented 4 years ago

I wasn't aware of the vpermt2q instruction! Having looked at it, it appears a single-instruction broadcast is also achievable with this then?

With the option to hold the permute operands in registers, does this not leave 22 registers for data in DP? Maybe I haven't spotted why a total of 3 temp registers are required (C Accumulation, broadcasted value of A and ??) by misunderstanding how the instruction works.

Also, I'm not entirely sure how to add new instructions to generator_x86_instructions. Not certain what variables such as l_second or l_fpadj do. Any links to how to contribute would be appreciated. But once added, and if vpermt2q/d does in-fact allow for single-instruction broadcasting, it would be the objectively better solution. So I'd be happy to implement that.

In terms of case coverage, the suite of test operator matrices are examples from PyFR provided by Freddie (170 total).	Unique NZ	Num Matrices
<= 31	113	66.5
<= 120	144	84.7
<= 176	146	85.9
<= 240	155	91.2

However, all sparse (density of at least <0.5) operator matrices (110 total) fit within the 120 limitation.	Unique NZ	Num Matrices
<= 31	94	85.5
<= 120	110	100
<= 176	110	100
<= 240	110	100

The above stats were after 'cleaning' the matrices by coalescing values within 1e-10 together. I will confirm with Freddie later this week if that is OK to do. The cleaning step was inherited from a previous project.

For the raw matrices without cleaning (total 170):	Unique NZ	Num Matrices
<= 31	42	24.7
<= 120	87	51.2
<= 176	99	58.2
<= 240	115	67.5

For the raw sparse matrices without cleaning (total 110):	Unique NZ	Num Matrices
<= 31	34	30.9
<= 120	69	62.7
<= 176	80	72.7
<= 240	87	79.1

alheinecke commented 4 years ago

Yes all these instructions allow to arbitrarily shuffle elements around, so we can do broadcast:

#include <stdlib.h>
#include <stdio.h>
#include <immintrin.h>

int main( int argc, char* argv[] ) {
  float data[16];
  float dest[16];
  int shuffle[16];
  int i;

  for ( i = 0; i < 16; ++i ) {
    data[i] = (float)i;
    shuffle[i] = 4;
  }

  __m512i sel = _mm512_loadu_epi32( shuffle );
  __m512  in  = _mm512_loadu_ps( data );
  __m512 out  = _mm512_permutexvar_ps( sel, in );
  _mm512_storeu_ps( dest, out );

  for ( i = 0 ; i < 16; ++i ) {
    printf("%f ", dest[i]);
  }
  printf("\n");

  return 0;
}

If it helps you, we would add the following instructions to the encoder: vpermw (already there), vpermd (already there), vpermq, vpermps, vpermpd.

Even the 64 bit data you should be able to handle with vpermd/w, but if helps you we can add vpermq/ps/pd.

I don't think you need the t2 (overwrite table) and i2 (overwrite index) versions.

mahudu97 commented 4 years ago

Great! Managed to implement this with VPERMD now for both DP and SP.

I'll be evaluating with my suite and check that it also works for SP.

For a PR, after my evaluation I'll make a version that only uses run-time broadcasting if unique NZZ > 31. Should be done by the weekend :)

Feel free to close this issue now, and I can open a fresh one for the PR (or leave open and I'll link to this).

alheinecke commented 4 years ago

Great! Let's keep the issue open and close it with the PR!

mahudu97 commented 4 years ago

So after slightly adjusting libxsmm_generator_spgemm_csr_asparse_reg and things inside libxsmm_main to test SP, I wasn't getting correct results.

So I then tested with the master branch, with those small adjustments as well. After adding tracing prints to confirm that csr_asparse_reg is indeed used, the results of the MM were still not valid.

Seeing as though DP is only really supported atm with this generator, does it make sense to keep the additional SP code for the new permute broadcasting? (DP is working great)

alheinecke commented 4 years ago

how do you test the float version? Can you share your float tester? This one is only double: https://github.com/hfp/libxsmm/blob/master/samples/pyfr/pyfr_driver_asp_reg.c

I have committed some changes as there was a problem with the lazy init, please merge the latest master.

mahudu97 commented 4 years ago

So the float version not working was my fault, typo in loading the permute operands.

To test, I just had to remove the F64 check in the case LIBXSMM_BUILD_KIND_SREG to prevent it defaulting to dense kernel when creating via libxsmm_sfsspmdm_create https://github.com/mahudu97/libxsmm/commit/aa535270f6eb5b308861a4c2db3744a4ea8c7157

Raising PR now, testing out a further L1 version for the permute operands.

alheinecke commented 4 years ago

Thanks. can you please also submit a PR for the change in libxsmm_main.c as I cannot test F32 with the current tester?

mahudu97 commented 4 years ago

So when testing my benchmark with the new master (before any PRs for this issue here merged), I get a seg fault when calling libxsmm_Xfsspmdm_execute but the JIT process libxsmm_Xfsspmdm_create succeeds, returning a non-null pointer.

This doesn't occur when running based on a master from December, So it could be due to some changes I'm unaware of. I tried debugging with GDB but couldn't back trace into the lib code.

Is a call to this function part of CI? If it's working elsewhere then the issue would be due to not updating code on my side of the benchmark.

Currently I port over new additions based on old master, to new master for PRs.

alheinecke commented 4 years ago

@hfp: can you help here.

I'm able to run this code with the latest master: https://github.com/hfp/libxsmm/blob/master/samples/pyfr/pyfr_driver_asp_reg.c

hfp commented 4 years ago

Let me also try to reproduce/check this issue.

hfp commented 4 years ago

Is a call to this function part of CI?

No. We have no coverage for PyFR reproducers or libxsmm_Xfsspmdm_*.

@alheinecke your ask for help was about CI coverage?

alheinecke commented 4 years ago

Both: if we can reproduce when @mahudu97 shared the code and then let's add this kernel to CI

hfp commented 4 years ago

We should migrate/reuse some of this content into proper documentation for sparse domain (e.g., documentation/libxsmm_sp.md).

hfp commented 4 years ago

Was this item taken? I should still create some test case?

libxsmm / libxsmm

A Sparse CSR Fallback Kernel #378

Broadcasting