Open giuseros opened 2 years ago
cc: @stevenvar
Thanks for your Q, this is a missing feature that is being pushed back into our stack of shorter term things thanks to recent developments. I'd recommend you come ask questions in the public IREE chat named "codegen" for low-latency iteration: https://discord.gg/ZNtWrXF6.
There are a few things intersecting here but a rough summary is:
I think 4. is realtively easy to get started on as far as extending the op semantics/verification/tests and tracking uses to ensure transformations fail in the presence of this permutation_map. Vectorization should also be reasonably easy by just inserting the vector.transpose between the read/write. Extensions to hoist padding are a bit more involved but we know what to do.
Do you guys want to take ownership of point 4. and starting working on core MLIR patches?
Hi Nicolas, thanks for the answer. Yes, @stevenvar is having a look at this. We saw two possibilities: a) Having a pass that hoists the transpose out of the micro-kernel b) Hoisting the transpose "ab initio", on the line of point 4 in your list
Do you think that a) is the wrong way to go (or it is harder to do than b) )?
About your point 1., do you mean x86 has got a fmla vec, vec, scalar
? In Arm there is an indexed fmla fmla vec_c vec_a, vec_b[i]
that broadcasts the i-th lane of vec_b
into a logical vector broadcast
and then does fmla vec_c, vec_a, broadcast
. The point is that vec_b
is still a vector that needs to be loaded from memory rather than a single scalar
Do you think that a) is the wrong way to go (or it is harder to do than b) )?
I don't think a) is wrong in itself but def. harder given the state of the world and there are also tradeoffs + composability differences:
About your point 1., do you mean x86 has got a fmla vec, vec, scalar?
There is an instruction vfmadd231ps zmm0,zmm4,DWORD PTR [rsi+0x4]{1to16}
See slide 42 of this prez: https://drive.google.com/corp/drive/folders/1lLhWopx_WCtFq3gTDGVJEzV9hFD7dwmI.
Thanks a lot for all the explanation. So yes, we can take ownership of point 4.
Hi @nicolasvasilache , all, Before I adventure in writing a pass, I was wondering if you guys already thought about how transposition is handled in the code.
How (I think) transposition is handled in the current code
During packing:
memref<?x?x?xKx1>
In the micro-kernel
Kx1
buffervector.transpose
to have our1xK
vectorouterproduct
Is this what we are doing, or am I missing something?
What I would like to have
Is it possible to hoist the
vector.transpose
in the packing phase? I.e., :During packing:
vector.transpose
and store in a buffer of typememref<?x?x?x1xK>
In the micro-kernel:
vector.load
from the1xK
bufferouterproduct
Is this something hard to do?
Thanks, Giuseppe