Proposal is about adding ternary shuffles, similar to PULP pv.shuffle2.{b,h} instruction. But with the "byte/reg select" bits in its own lane for easier data dependent shuffling and symmetry with xperm8. (the downside being that the mask can no longer be loaded with single addi on RV32 (loop invariant anyway))
It should also zero out the result if the index is out of bounds: i.e. any bit set above the reg select bit (similar to xperm8)
Those were proved to have an significant use in convolutions and matmul, but in some cases suffer from extra moves due to its destructive form (e.g. listing 1 in [1], table 5.9 in [2]).
Because of this I think that needs to be an R4 type ternary instr similar to cmix (aka BPICK), or funnel shifts.
Otherwise there needs to be 2 variants of that same isntruction (doesn't solve issues in [1] and [2] though):
In addition to
xperm8
andxperm4
from zbkx.Proposal is about adding ternary shuffles, similar to PULP
pv.shuffle2.{b,h}
instruction. But with the "byte/reg select" bits in its own lane for easier data dependent shuffling and symmetry with xperm8. (the downside being that the mask can no longer be loaded with singleaddi
on RV32 (loop invariant anyway))It should also zero out the result if the index is out of bounds: i.e. any bit set above the reg select bit (similar to
xperm8
)Those were proved to have an significant use in convolutions and matmul, but in some cases suffer from extra moves due to its destructive form (e.g. listing 1 in [1], table 5.9 in [2]).
Because of this I think that needs to be an R4 type ternary instr similar to
cmix
(aka BPICK), or funnel shifts. Otherwise there needs to be 2 variants of that same isntruction (doesn't solve issues in [1] and [2] though):[1] - https://arxiv.org/pdf/2004.11690.pdf [2] - https://webthesis.biblio.polito.it/18144/1/tesi.pdf