static-analysis-engineering / CodeHawk-Binary

CodeHawk Binary Analyzer for malware analysis and general reverse engineering
MIT License
24 stars 10 forks source link

Explicit CFG for ARM predicated instructions #51

Open brk opened 2 years ago

brk commented 2 years ago

Should predicated instructions be explicitly represented with a basic block in the control flow graph?

This source for Euclid's GCD:

int target(int a, int b) {
  while (a != b) {
    if (a > b) {
      a = a - b;
    } else {
      b = b - a;
    }
  }
  return a;
}

becomes this -O2 assembly:

000104f0 <target>:
   104f0:       e1500001    cmp r0, r1
   104f4:       012fff1e    bxeq    lr
   104f8:   /-> e0512000    subs    r2, r1, r0
   104fc:   |   b1a02001    movlt   r2, r1
   10500:   |   b0400001    sublt   r0, r0, r1
   10504:   |   e1500002    cmp r0, r2
   10508:   |   e1a01002    mov r1, r2
   1050c:   \-- 1afffff9    bne 104f8 <target+0x8>
   10510:       e12fff1e    bx  lr

which gets a base CFG with only three nodes: 104f0, 104f8 (which is a self-loop), and 10510. I would naively have expected the bxeq and the mov-sub pair to each get a basic block.

The lifted source I'd like to aim for, eventually, would be (modulo variable names):

int c;
if (a == b) return a;
do {
  c = b - a;
  if (c < 0) {
    c = b;
    a = a - b;
  }
  b = c;
} while (a == c);
return a;

There are, of course, several open questions about how we might represent the provenance for such phantom CFG nodes, and also how to reliably drive patching from edits to the lifted source corresponding to predicated instructions.