Closed romanovj closed 6 months ago
Hello. This is instrumentation limitation, we're only supporting it for machines with LSE extension currently.
@yota9 So what are my options for cortex-a53 device? Can i optimize binary/library with profile? If so then I need to use coresight-etm, right? Or maybe there is something else.
@romanovj The only 2 options are:
As for hardware traces - I didn't heard for somebody to use them and there is no BOLT support for them, you would have to translate the profile to the BOLT's fdata file format by your self somehow. Or, of course, you may port instrumentation to use the exclusive memory operations for non-LSE targets... So I believe currently there is no great out-of-the box solution for you :(
@yota9 Well, thank you! I will try to replace LSE instructions.
@romanovj The only 2 options are:
- Instrumentation with LSE.
- Perf without LBR (but it wouldn't have great optimisation results)
As for hardware traces - I didn't heard for somebody to use them and there is no BOLT support for them, you would have to translate the profile to the BOLT's fdata file format by your self somehow. Or, of course, you may port instrumentation to use the exclusive memory operations for non-LSE targets... So I believe currently there is no great out-of-the box solution for you :(
Regarding converting etm trace to fdata: please check this branch https://github.com/lanza/llvm-project/commits/lanzabolt/
my ugly code
diff --git a/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp b/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp
index 0ae9d3668..400840be1 100644
--- a/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp
+++ b/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp
@@ -109,6 +109,17 @@ static void atomicAdd(MCInst &Inst, MCPhysReg RegTo, MCPhysReg RegCnt) {
Inst.addOperand(MCOperand::createReg(RegTo));
}
+static void createIncrement(
+ MCInst &Inst, MCPhysReg Reg, int Size,
+ bool NoFlagsClobber = false /*unused for AArch64*/) {
+ Inst.setOpcode(AArch64::ADDXri);
+ Inst.clear();
+ Inst.addOperand(MCOperand::createReg(Reg));
+ Inst.addOperand(MCOperand::createReg(Reg));
+ Inst.addOperand(MCOperand::createImm(Size));
+ Inst.addOperand(MCOperand::createImm(0));
+ }
+
static void createMovz(MCInst &Inst, MCPhysReg Reg, uint64_t Imm) {
assert(Imm <= UINT16_MAX && "Invalid Imm size");
Inst.clear();
@@ -118,6 +129,7 @@ static void createMovz(MCInst &Inst, MCPhysReg Reg, uint64_t Imm) {
Inst.addOperand(MCOperand::createImm(0));
}
+
static InstructionListType createIncMemory(MCPhysReg RegTo, MCPhysReg RegTmp) {
InstructionListType Insts;
Insts.emplace_back();
@@ -1560,7 +1572,7 @@ public:
createInstrIncMemory(const MCSymbol *Target, MCContext *Ctx, bool IsLeaf,
unsigned CodePointerSize) const override {
unsigned int I = 0;
- InstructionListType Instrs(IsLeaf ? 12 : 10);
+ InstructionListType Instrs(IsLeaf ? 13 : 11);
if (IsLeaf)
createStackPointerIncrement(Instrs[I++], 128);
@@ -1571,10 +1583,13 @@ public:
std::copy(Addr.begin(), Addr.end(), Instrs.begin() + I);
I += Addr.size();
storeReg(Instrs[I++], AArch64::X2, AArch64::SP);
- InstructionListType Insts = createIncMemory(AArch64::X0, AArch64::X2);
- assert(Insts.size() == 2 && "Invalid Insts size");
- std::copy(Insts.begin(), Insts.end(), Instrs.begin() + I);
- I += Insts.size();
+ loadReg(Instrs[I++], AArch64::X2, AArch64::X0);
+// InstructionListType Insts = createIncMemory(AArch64::X0, AArch64::X2);
+ // assert(Insts.size() == 2 && "Invalid Insts size");
+ // std::copy(Insts.begin(), Insts.end(), Instrs.begin() + I);
+ // I += Insts.size();
+ createIncrement(Instrs[I++], AArch64::X2, 1);
+ storeReg(Instrs[I++], AArch64::X2, AArch64::X0);
loadReg(Instrs[I++], AArch64::X2, AArch64::SP);
setSystemFlag(Instrs[I++], AArch64::X1);
createPopRegisters(Instrs[I++], AArch64::X0, AArch64::X1);
result: _init
│ 0x00a00014 020040f9 ldr x2, [x0]
│ 0x00a00018 42040091 add x2, x2, 1
│ 0x00a0001c 020000f9 str x2, [x0]
pseudo C (r2ghidra)
*(0 + 0x1090000) = *(0 + 0x1090000) + 1;
@romanovj Well it is not atomic increment as you understand, so results might be not stable, but might give you good-enough profile anyway :)
@yota9 should I use exclusive load and store?
@romanovj if you want accurate profile. Your change uses non-atomic add if multiple threads would increment the same counter the result would be undefined
@yota9 Looks like there is no simple solution without atomic instructions. Later I will try to optimize software for routers. Many popular routers use a cortex-a53 cores.
I have compiled llvm-bolt in termux glibc environment (android). My kernel is using armv8-a instruction set. I don't know why bolt is using
stadd
instruction when adding instrumentation.