Closed mahudu97 closed 4 years ago
We offer "soa" and "packed" kernels. The simple CSR kernel where never needed in recent past and the intrinisic/static kernel have been enough.
There are plans to pack more values into the kernel, but the benefits so far have been minor. What is you concrete application?
The use case is the sparse matrix multiplication that occurs in PyFR
.
I have a method that packs more values into the simple CSR kernel, in my fork. It's part of my thesis (Imperial College, Supervisor Prof. Paul Kelly). Link: https://github.com/mahudu97/libxsmm/commit/815d80bab476a5e884190c9f5be74011b5ba0b7d
The intention is to make a pull request after evaluating within my thesis. In the PR I'd of course provide a detailed explanation of the method / changes.
In summary, it packs up to 120
unique elements in an AVX-512 register file (both DP and SP), and broadcasts with one instruction (3 cycle latency) regardless of the lane the element is stored in.
I had a look at the "soa" kernels. I assume "soa" is Struct of Arrays? (Sorry not too familiar with some of the matrix terminology, if its something else).
I noticed it uses VBROADCASTSX
as its move A
instruction. Does this kernel load the values of A
from memory? (as opposed to broadcasting values already in registers, as I believe only lane zero can be broadcast using this instruction)
Finally, I'm not too sure what the packed
family of GEMM generators are used for, could you please give me an indication of their use? Thanks.
EDIT: After overlooking the role of libxsmm_fsspmdm.c
, it appears to provide a wrapper for allocating a basic CSR structure as well as call the eventual asparse_reg
routine. I guess my initial query changes to why does this same API path not attempt to use the soa
kernel if reg
kernel is not available. I have looked around, and there doesn't appear to be a similar API wrapper for the soa
kernel.
I also looked closer at the libxsmm_generator_spgemm_csr_asparse
kernel, and it seems to not use vectorisation, so that'd make sense as to why its not considered by libxsmm_dfsspmdm_create
Ah, great, yes I talked with Freddie and Paul ~8months ago and that we should write such a kernel:-) We are happy to take the changes. When Freddie and Paul and I talked we had 3 kernel variants in discussions: o RegA -> with permutes o RegA -> with Shuffles o RegA -> with compact A storage and loading form L1$ I think your code is the second case, correct? Yes the SoA kernel loads everything from memory, but I believe it's not applicable to PyFR. Since I know have context, I think you are doing the exactly the right thing. Some background on the SOA kernels can be found here: https://dial3343.org/, http://short.dial3343.org/isc19pap , https://dial3343.org/2018/06/14/libxsmm.html , http://short.dial3343.org/isc17pap
Wrt PR: I suggest you implement it in a way that we have both versions in the code. That means for < 31 unique NNZ we use the old code and for up to 120 we generate your new code?
Can you elaborate more on the 120 limitation? I would have expected ~224 (288) for DP and 448 (2816) for SP, while leaving 4 registers for temp values empty. 120 would mean you use only 15 registers.
Yes, all the magic is between of kernel selection is in libxsmm_fsspmdm.c.
Thanks for the links! My code is indeed the RegA with Shuffles case.
The reason for the 120 limitation is that the method's broadcasting only requires one instruction (3 cycle latency Skylake-SP) - namely VSHUFF64X2
. This means only one temp register is required. So then 30
registers are available to store packed A. (other used to accumulate C). This leads to the following register file layout.
The reason for storing a packed value twice, is that it can then be 'broadcasted' with a single VSHUFF64X2
, shown by:
The same source register is used for both source operands. The encoded selector then picks the same 128-bit source section for every 128-bit dest. section. (so 120 limit also applies to SP as need to store a value 4 times to fill 128-bits)
These were taken from the chapter I'm now writing for my thesis. I'm working on a more detailed explanation, of which I'd be happy to attach an excerpt to this issue once written.
For more than 120
unique non-zeros, there are probably a few multiple-instruction broadcasting techniques, however, I haven't looked extensively for one.
For a PR, it does make sense to split the strategy. Although my evaluation has shown no performance penalty when using the new method, when unique NZ <=31. A split strategy approach would also flow well for a over 120 unique NZ method to be added.
A more detailed walk-through of the solution: register_packing.pdf
Awesome, this is great and I also now understand as well the 120 element limitation. How many cases can be covered by this? What is the coverage increase when going from 31->120?
With vpermt2q or vpermt2d we can shuffle arbitrary (of course at a higher cost). However, this would now allow for up 232 unique values DP and 464 SP, assuming we read the 8/16 broadcast permute operands (1 full zmm) from memory, but from L1 data cache.
Other option is to hold the permute operands in registers that would mean 21 registers for data in DP: 218 = 168 and 1716 for SP = 272. This was the case I mentioned with vprem previously.
I wasn't aware of the vpermt2q instruction! Having looked at it, it appears a single-instruction broadcast is also achievable with this then?
With the option to hold the permute operands in registers, does this not leave 22
registers for data in DP? Maybe I haven't spotted why a total of 3 temp registers are required (C Accumulation, broadcasted value of A and ??) by misunderstanding how the instruction works.
Also, I'm not entirely sure how to add new instructions to generator_x86_instructions
. Not certain what variables such as l_second
or l_fpadj
do. Any links to how to contribute would be appreciated.
But once added, and if vpermt2q/d does in-fact allow for single-instruction broadcasting, it would be the objectively better solution. So I'd be happy to implement that.
In terms of case coverage, the suite of test operator matrices are examples from PyFR provided by Freddie (170 total). | Unique NZ | Num Matrices | Coverage (%) |
---|---|---|---|
<= 31 | 113 | 66.5 | |
<= 120 | 144 | 84.7 | |
<= 176 | 146 | 85.9 | |
<= 240 | 155 | 91.2 |
However, all sparse (density of at least <0.5) operator matrices (110 total) fit within the 120 limitation. | Unique NZ | Num Matrices | Coverage (%) |
---|---|---|---|
<= 31 | 94 | 85.5 | |
<= 120 | 110 | 100 | |
<= 176 | 110 | 100 | |
<= 240 | 110 | 100 |
The above stats were after 'cleaning' the matrices by coalescing values within 1e-10
together. I will confirm with Freddie later this week if that is OK to do. The cleaning step was inherited from a previous project.
For the raw matrices without cleaning (total 170): | Unique NZ | Num Matrices | Coverage (%) |
---|---|---|---|
<= 31 | 42 | 24.7 | |
<= 120 | 87 | 51.2 | |
<= 176 | 99 | 58.2 | |
<= 240 | 115 | 67.5 |
For the raw sparse matrices without cleaning (total 110): | Unique NZ | Num Matrices | Coverage (%) |
---|---|---|---|
<= 31 | 34 | 30.9 | |
<= 120 | 69 | 62.7 | |
<= 176 | 80 | 72.7 | |
<= 240 | 87 | 79.1 |
Yes all these instructions allow to arbitrarily shuffle elements around, so we can do broadcast:
#include <stdlib.h>
#include <stdio.h>
#include <immintrin.h>
int main( int argc, char* argv[] ) {
float data[16];
float dest[16];
int shuffle[16];
int i;
for ( i = 0; i < 16; ++i ) {
data[i] = (float)i;
shuffle[i] = 4;
}
__m512i sel = _mm512_loadu_epi32( shuffle );
__m512 in = _mm512_loadu_ps( data );
__m512 out = _mm512_permutexvar_ps( sel, in );
_mm512_storeu_ps( dest, out );
for ( i = 0 ; i < 16; ++i ) {
printf("%f ", dest[i]);
}
printf("\n");
return 0;
}
If it helps you, we would add the following instructions to the encoder: vpermw (already there), vpermd (already there), vpermq, vpermps, vpermpd.
Even the 64 bit data you should be able to handle with vpermd/w, but if helps you we can add vpermq/ps/pd.
I don't think you need the t2 (overwrite table) and i2 (overwrite index) versions.
Great! Managed to implement this with VPERMD
now for both DP and SP.
I'll be evaluating with my suite and check that it also works for SP.
For a PR, after my evaluation I'll make a version that only uses run-time broadcasting if unique NZZ > 31. Should be done by the weekend :)
Feel free to close this issue now, and I can open a fresh one for the PR (or leave open and I'll link to this).
Great! Let's keep the issue open and close it with the PR!
So after slightly adjusting libxsmm_generator_spgemm_csr_asparse_reg
and things inside libxsmm_main
to test SP, I wasn't getting correct results.
So I then tested with the master
branch, with those small adjustments as well. After adding tracing prints to confirm that csr_asparse_reg is indeed used, the results of the MM were still not valid.
Seeing as though DP is only really supported atm with this generator, does it make sense to keep the additional SP code for the new permute broadcasting? (DP is working great)
how do you test the float version? Can you share your float tester? This one is only double: https://github.com/hfp/libxsmm/blob/master/samples/pyfr/pyfr_driver_asp_reg.c
I have committed some changes as there was a problem with the lazy init, please merge the latest master.
So the float version not working was my fault, typo in loading the permute operands.
To test, I just had to remove the F64 check in the case LIBXSMM_BUILD_KIND_SREG
to prevent it defaulting to dense kernel when creating via libxsmm_sfsspmdm_create
https://github.com/mahudu97/libxsmm/commit/aa535270f6eb5b308861a4c2db3744a4ea8c7157
Raising PR now, testing out a further L1 version for the permute operands.
Thanks. can you please also submit a PR for the change in libxsmm_main.c as I cannot test F32 with the current tester?
So when testing my benchmark with the new master (before any PRs for this issue here merged), I get a seg fault when calling libxsmm_Xfsspmdm_execute
but the JIT process libxsmm_Xfsspmdm_create
succeeds, returning a non-null pointer.
This doesn't occur when running based on a master from December, So it could be due to some changes I'm unaware of. I tried debugging with GDB but couldn't back trace into the lib code.
Is a call to this function part of CI? If it's working elsewhere then the issue would be due to not updating code on my side of the benchmark.
Currently I port over new additions based on old master, to new master for PRs.
@hfp: can you help here.
I'm able to run this code with the latest master: https://github.com/hfp/libxsmm/blob/master/samples/pyfr/pyfr_driver_asp_reg.c
Let me also try to reproduce/check this issue.
Is a call to this function part of CI?
No. We have no coverage for PyFR reproducers or libxsmm_Xfsspmdm_*
.
@alheinecke your ask for help was about CI coverage?
Both: if we can reproduce when @mahudu97 shared the code and then let's add this kernel to CI
We should migrate/reuse some of this content into proper documentation for sparse domain (e.g., documentation/libxsmm_sp.md
).
Was this item taken? I should still create some test case?
When using
libxsmm_dfsspmdm_create
to JIT a kernel for sparse A, if the number of unique constants does not fit in a register kernel, the fallback is to use a dense routine.Is there a reason as to why
libxsmm_generator_spgemm_csr_asparse
is not considered?A GitHub search shows that
libxsmm_generator_spgemm_csr_asparse
\libxsmm_generator_spgemm_csr_kernel
is unused throughout the library.