Closed bwesterb closed 11 months ago
@llvm/issue-subscribers-backend-aarch64
That would indeed provide a significant speedup of some cryptographic primitives on AArch64.
@llvm/issue-subscribers-good-first-issue
Hi! I'd like to work on this issue.
From what I understand from the shared example:
We generate a bcax instr for st[i * 5 + j] = t[j] ^ (~t[(j + 1) % 5] & t[(j + 2) % 5]);
And for st[j * 5 + i] ^= t[(i + 4) % 5] ^ math.rotl(@Vector(l, T), t[(i + 1) % 5], 1)
We generate:
ushr.2d v25, v12, #63
shl.2d v31, v12, #1
orr.16b v31, v31, v25
eor3.16b v25, v10, v31, v5
eor3.16b v14, v10, v31, v20
eor3.16b v20, v10, v31, v21
eor3.16b v21, v10, v31, v22
eor3.16b v24, v10, v31, v23
This uses an unsigned shift right and a shift left to first perform the rotation, and then performs a 3 way xor to get the results. Instead of this, we want to see something like:
rax1.2d v31, v12, v10
eor.16b v25, v31, v5
eor.16b v14, v31, v20
eor.16b v20, v31, v21
eor.16b v21, v31, v22
eor.16b v24, v31, v23
Does this look right?
These should match the instructions in isolation: https://godbolt.org/z/WMv8csvo4 I believe the rax1 matches what you are saying.
See https://reviews.llvm.org/D120112 and https://reviews.llvm.org/D108793 for where the existing eor3 and bcax instructions were added.
Does this look right?
@BK1603 Yeah. For reference, I expect roughly something like this in the end.
Thanks @davemgreen and @bwesterb
I'll start working on this. Will raise more questions here in case I get stuck.
It sounds like you are on the right track. If you take a test file like:
define <2 x i64> @rax1(<2 x i64> %x, <2 x i64> %y) {
%a = call <2 x i64> @llvm.fshl.v2i64(<2 x i64> %y, <2 x i64> %y, <2 x i64> <i64 1, i64 1>)
%b = xor <2 x i64> %x, %a
ret <2 x i64> %b
}
declare <2 x i64> @llvm.fshl.v2i64(<2 x i64>, <2 x i64>, <2 x i64>)
And run it like this:
bin/llc -mtriple aarch64 -mattr=+sha3 test.ll -debug
You should see debug output that includes:
Optimized legalized selection DAG: %bb.0 'rax1:'
SelectionDAG has 14 nodes:
t0: ch,glue = EntryToken
t4: v2i64,ch = CopyFromReg t0, Register:v2i64 %1
t2: v2i64,ch = CopyFromReg t0, Register:v2i64 %0
t22: v2i64 = AArch64ISD::VSHL t4, Constant:i32<1>
t24: v2i64 = AArch64ISD::VLSHR t4, Constant:i32<63>
t20: v2i64 = or t22, t24
t8: v2i64 = xor t2, t20
t10: ch,glue = CopyToReg t0, Register:v2i64 $q0, t8
t11: ch = AArch64ISD::RET_GLUE t10, Register:v2i64 $q0, t10:1
Those are the nodes that will need to be matched, some of which will be AArch64 specific nodes.
Hi @davemgreen
Thanks a lot for the help! The debug output from llc was immensely helpful. I have made a commit and pushed to fabricator here:
https://reviews.llvm.org/D147887
I had a question, we chose to use the arm instructions vshl
vlshr
for pattern matching. Although in the debug output, I saw that we did create slr
and shl
nodes as well. Is there a reason as to why we did not use them as inputs for pattern matching?
Something like:
def : Pat <(xor (or ((slr (vi264 V128:$Vm), (i32 63)), (shl (v2i64 V128:$Vm), (i32 1))), (v2164 V128:$Vn)),
(RAX1 (v2i64 V128:$Vm), (v2i64 V128:$Vn)>
Hi @davemgreen
I fixed the operand order in my commit. Also I seem to be stuck again. For XAR we want to be able to rotate the element by any variable value. For which the dag nodes generated are as follows:
Optimized legalized selection DAG: %bb.0 'xar_54:'
SelectionDAG has 14 nodes:
t0: ch,glue = EntryToken
t2: v2i64,ch = CopyFromReg t0, Register:v2i64 %0
t4: v2i64,ch = CopyFromReg t0, Register:v2i64 %1
t5: v2i64 = xor t2, t4
t24: v2i64 = AArch64ISD::VSHL t5, Constant:i32<10>
t26: v2i64 = AArch64ISD::VLSHR t5, Constant:i32<54>
t22: v2i64 = or t24, t26
t10: ch,glue = CopyToReg t0, Register:v2i64 $q0, t22
t11: ch = AArch64ISD::RET_GLUE t10, Register:v2i64 $q0, t10:1
I was trying something like:
def SubtractFrom64 : SDNodeXForm<timm, [{ return 64 - N->getZextValue(); }]>;
def : Pat<(or (Aarch64vshl (xor (v2i64 V128:$Vn), (v2i64 V128:$Vm)), (timm0_63:$imm)), (AArch64vlshr (xor (v2i64 V128:$Vn), (v2i64 V128:$Vm)), (SubtractFrom64 timm0_63:$imm)))>;
But ran into an error saying I can not use SubtractFrom64 as an input pattern (why is that so?). Could you please tell me if there are any instructions with similar requirements that I could learn from? Are there no ways to look for immediates that follow a constraint in tablegen?
I asked around at discord and was told that I won't be able to get a pattern to match for this. I'd have to look at both immediates and ensure that they add upto 64. And that this would be possible by altering the switch statement in AArch64DAGToDAGISel::Select
Yeah that might be correct. It may be possible to add a complex pattern that detects the or(AArch64ISD::VSHL, AArch64ISD::VLSHR)
with matching shift amounts, but at that point you might as well do the matching in AArch64DAGToDAGISel.
Hi @davemgreen I was trying something like this:
bool AArch64DAGToDAGISel::SelectXAR(SDNode *N) {
assert(N->getOpcode() == ISD::OR && "Expected OR instruction");
SDNode *N0 = cast<SDNode>(N->getOperand(0));
SDNode *N1 = cast<SDNode>(N->getOperand(1));
if (!(N0->getOpcode() == AArch64ISD::VSHL) || !(N1->getOpcode() == AArch64ISD::VLSHR))
return false;
if (!(N0->getOperand(0)->getOpcode() == ISD::XOR) || !(N1->getOperand(0)->getOpcode() == ISD::XOR))
return false;
SDNode *XOR = cast<SDNode>(N0->getOperand(0));
SDValue R1 = XOR->getOperand(0);
SDValue R2 = XOR->getOperand(1);
SDValue imm = N1->getOperand(1);
unsigned HsAmt = cast<ConstantSDNode>(N0->getOperand(1))->getZExtValue();
unsigned ShAmt = cast<ConstantSDNode>(N1->getOperand(1))->getZExtValue();
if (!(HsAmt + ShAmt == 64))
return false;
SDValue Ops[] = { R1, R2, imm };
CurDAG->SelectNodeTo(N, AArch64::XAR, MVT::i64, Ops);
return false;
}
(I know there are a lot of rough edges to this, please let me know things that I can fix that you can spot right away.)
and then in the Select
function if we can use the following condition check:
case ISD::OR:
if (tryBitfieldInsertOp(Node))
return;
// Try matching XAR instruction here
if (SelectXAR(Node))
return;
break;
But this leads to the following assert failing in SelectCodeCommon
`!NodeToMatch->isMachineOpcode() && "Node already selected!"'
I think it is happening becuase I used AArch64::XAR opcode. I tried using AArch64ISD::XAR but there is no such value present in the enum. Can you please guide me at how I should proceed with this?
Found the issue, return should be true
, I'll make a patch and send it for review.
Hi @davemgreen Made a patch for this here: https://reviews.llvm.org/D150076
Please review and let me know, thanks!
Any plans on picking this up?
Sorry for the inactivity on the patch, @davemgreen I have pushed another patch resolving your comments. Can you please review? https://reviews.llvm.org/D150076
Hi @davemgreen Updated the patch again, notifying here as well in case phabricator notifications do not work post workflow change.
Edit: If everything looks good can you please merge the change as well? I don't have the permissions for a merge. Can use the following for sign off:
Shreyansh Chouhan <chouhan.shreyansh2702@gmail.com>
LLVM already generates BCAX and EOR3 from the ARMv8.2-SHA crypto extension set. It doesn't for RAX1 and XAR yet.