LLVM-BOLT How to prevent BOLT from using armv8.1 instructions on armv8 machine?

romanovj commented 6 months ago

I have compiled llvm-bolt in termux glibc environment (android). My kernel is using armv8-a instruction set. I don't know why bolt is using stadd instruction when adding instrumentation.

yota9 commented 6 months ago

Hello. This is instrumentation limitation, we're only supporting it for machines with LSE extension currently.

romanovj commented 6 months ago

@yota9 So what are my options for cortex-a53 device? Can i optimize binary/library with profile? If so then I need to use coresight-etm, right? Or maybe there is something else.

yota9 commented 6 months ago

@romanovj The only 2 options are:

Instrumentation with LSE.
Perf without LBR (but it wouldn't have great optimisation results)

As for hardware traces - I didn't heard for somebody to use them and there is no BOLT support for them, you would have to translate the profile to the BOLT's fdata file format by your self somehow. Or, of course, you may port instrumentation to use the exclusive memory operations for non-LSE targets... So I believe currently there is no great out-of-the box solution for you :(

romanovj commented 6 months ago

@yota9 Well, thank you! I will try to replace LSE instructions.

aaupov commented 6 months ago

@romanovj The only 2 options are:

Instrumentation with LSE.

Perf without LBR (but it wouldn't have great optimisation results)

As for hardware traces - I didn't heard for somebody to use them and there is no BOLT support for them, you would have to translate the profile to the BOLT's fdata file format by your self somehow. Or, of course, you may port instrumentation to use the exclusive memory operations for non-LSE targets... So I believe currently there is no great out-of-the box solution for you :(

Regarding converting etm trace to fdata: please check this branch https://github.com/lanza/llvm-project/commits/lanzabolt/

romanovj commented 6 months ago

my ugly code

diff --git a/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp b/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp
index 0ae9d3668..400840be1 100644
--- a/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp
+++ b/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp
@@ -109,6 +109,17 @@ static void atomicAdd(MCInst &Inst, MCPhysReg RegTo, MCPhysReg RegCnt) {
   Inst.addOperand(MCOperand::createReg(RegTo));
 }

+static void createIncrement(
+      MCInst &Inst, MCPhysReg Reg, int Size,
+      bool NoFlagsClobber = false /*unused for AArch64*/) {
+    Inst.setOpcode(AArch64::ADDXri);
+    Inst.clear();
+    Inst.addOperand(MCOperand::createReg(Reg));
+    Inst.addOperand(MCOperand::createReg(Reg));
+    Inst.addOperand(MCOperand::createImm(Size));
+    Inst.addOperand(MCOperand::createImm(0));
+ }
+
 static void createMovz(MCInst &Inst, MCPhysReg Reg, uint64_t Imm) {
   assert(Imm <= UINT16_MAX && "Invalid Imm size");
   Inst.clear();
@@ -118,6 +129,7 @@ static void createMovz(MCInst &Inst, MCPhysReg Reg, uint64_t Imm) {
   Inst.addOperand(MCOperand::createImm(0));
 }

+
 static InstructionListType createIncMemory(MCPhysReg RegTo, MCPhysReg RegTmp) {
   InstructionListType Insts;
   Insts.emplace_back();
@@ -1560,7 +1572,7 @@ public:
   createInstrIncMemory(const MCSymbol *Target, MCContext *Ctx, bool IsLeaf,
                        unsigned CodePointerSize) const override {
     unsigned int I = 0;
-    InstructionListType Instrs(IsLeaf ? 12 : 10);
+    InstructionListType Instrs(IsLeaf ? 13 : 11);

     if (IsLeaf)
       createStackPointerIncrement(Instrs[I++], 128);
@@ -1571,10 +1583,13 @@ public:
     std::copy(Addr.begin(), Addr.end(), Instrs.begin() + I);
     I += Addr.size();
     storeReg(Instrs[I++], AArch64::X2, AArch64::SP);
-    InstructionListType Insts = createIncMemory(AArch64::X0, AArch64::X2);
-    assert(Insts.size() == 2 && "Invalid Insts size");
-    std::copy(Insts.begin(), Insts.end(), Instrs.begin() + I);
-    I += Insts.size();
+    loadReg(Instrs[I++], AArch64::X2, AArch64::X0);
+//    InstructionListType Insts = createIncMemory(AArch64::X0, AArch64::X2);
+ //   assert(Insts.size() == 2 && "Invalid Insts size");
+ //   std::copy(Insts.begin(), Insts.end(), Instrs.begin() + I);
+ //   I += Insts.size();
+    createIncrement(Instrs[I++], AArch64::X2, 1);
+    storeReg(Instrs[I++], AArch64::X2, AArch64::X0);
     loadReg(Instrs[I++], AArch64::X2, AArch64::SP);
     setSystemFlag(Instrs[I++], AArch64::X1);
     createPopRegisters(Instrs[I++], AArch64::X0, AArch64::X1);

result: _init

│ 0x00a00014      020040f9       ldr x2, [x0]                          
│ 0x00a00018      42040091       add x2, x2, 1
│ 0x00a0001c      020000f9       str x2, [x0]

pseudo C (r2ghidra)

*(0 + 0x1090000) = *(0 + 0x1090000) + 1;

yota9 commented 6 months ago

@romanovj Well it is not atomic increment as you understand, so results might be not stable, but might give you good-enough profile anyway :)

romanovj commented 6 months ago

@yota9 should I use exclusive load and store?

yota9 commented 6 months ago

@romanovj if you want accurate profile. Your change uses non-atomic add if multiple threads would increment the same counter the result would be undefined

romanovj commented 6 months ago

@yota9 Looks like there is no simple solution without atomic instructions. Later I will try to optimize software for routers. Many popular routers use a cortex-a53 cores.

llvm / llvm-project

LLVM-BOLT How to prevent BOLT from using armv8.1 instructions on armv8 machine? #86485