panda-re / panda

Platform for Architecture-Neutral Dynamic Analysis
https://panda.re
Other
2.47k stars 474 forks source link

CallstackIsntr: Fix callback types #1447

Open AndrewFasano opened 6 months ago

AndrewFasano commented 6 months ago

1445 incorrectly change the callback types without changing how they were registered. Thanks to @LauraLMann and @MarkMankins for pointing this out! I'm slightly terrified that this sort of mistake doesn't cause a compile error (and that all the CI tests still passed)!

I reverted that PR on dev. This PR now includes both those changes + the fix so we can discuss the issue @MarkMankins has found when the intended change is applied.

MarkMankins commented 6 months ago

This change combined with #1445 introduces an inconsistency in the output of callstack_instr. Running the same recording multiple times can result in different output from callstack_instr...here's an example of what I'm seeing:

$ diff win2k_ssbranch_string_matches.txt win2k_ssbranch_string_matches_expected.txt                                                       1,5c1,5
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012cfa (asid=0x0e32e000)  1
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d0b (asid=0x0e32e000)  1
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc394 (asid=0x0e32e000)  2
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d72e 80402408 (asid=0x00000000)  2
< 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d947 80402408 (asid=0x00000000)  2
---
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012cfa (asid=0x0e32e000)  1
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d0b (asid=0x0e32e000)  1
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc394 (asid=0x0e32e000)  2
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d72e 80402408 (asid=0x00000000)  2
> 77fa15db 80465691 7c4e87f5 004013e2 004016be 78023734 7800f47c 7800f086 78012e72 78012d3d 7c4fc311 7c4fc415 7c4fc3bc 77f891d2 80465691 8049d947 80402408 (asid=0x00000000)  2
AndrewFasano commented 6 months ago

Oh no, I was expecting a difference with output from callstack_instr before/after these changes, but you're saying it's just producing varying output just running repeatedly with these changes?

MarkMankins commented 6 months ago

Oh no, I was expecting a difference with output from callstack_instr before/after these changes, but you're saying it's just producing varying output just running repeatedly with these changes?

Correct - different consistent output would not have surprised me - but I'm seeing output that varies between different runs of panda replaying the same recording file.

AndrewFasano commented 6 months ago

Yikes, that's no good. I'm going to revert #1445 on dev, and rebase all the commits into this branch/PR to keep this self contained.

@lacraig2 you're the SBE/EBE callback expert, do you have any idea why this might occur? Do EBE callbacks get triggered if a block's execution is interrupted?

lacraig2 commented 6 months ago

@AndrewFasano EBE callbacks do not occur when an exit is requested. SBE/EBE are injected into the TCG stream and only occur when execution reaches that point.

I am skeptical that this is the reason that we're experiencing an issue here. The recordings I've taken have tens of millions of instructions and tend to have tens of exits.

I've done some analysis:

#!/usr/bin/env python3
from sys import argv
from pandare import Panda

# Single arg of arch, defaults to i386
arch = "arm" if len(argv) <= 1 else argv[1]
panda = Panda(generic=arch)
from os.path import exists

recording_name = "rec"+arch
if not exists(f"{recording_name}-rr-snp"):
    print("taking recording")
    @panda.queue_blocking
    def do_stuff():
        panda.revert_sync("root")
        panda.run_monitor_cmd(f"begin_record {recording_name}")
        print(panda.run_serial_cmd("cat /etc/passwd && sleep 2 && cat /etc/passwd && ls -la"))
        panda.run_monitor_cmd("end_record")
        panda.stop_run()
    panda.run()
else:
    print("recording exists. not remaking recording")

bbe = 0
@panda.cb_before_block_exec
def bbed(cpu, tb):
    global bbe
    bbe += 1

abe = 0
num_exit = 0

@panda.cb_after_block_exec
def abed(cpu, tb, ec):
    global num_exit,num_exit_tb_exit
    if ec > 1:
        num_exit += 1

    global abe
    abe += 1

sbe = 0
@panda.cb_start_block_exec
def sbed(cpu, tb):
    global sbe
    sbe += 1

ebe = 0
@panda.cb_end_block_exec
def ebed(cpu, tb):
    global ebe
    ebe += 1

panda.enable_tb_chaining()
panda.run_replay(recording_name)

print(f"{bbe=}, {sbe=}")
print(f"Number missing: {bbe-sbe=}")

print(f"{abe=}, {ebe=}")
print(f"Number missing: {abe-ebe=}")

print(f"Starts vs ends: {sbe-ebe=}, {bbe-abe=}")

print(f"{num_exit=} ")

Which produces results like this:

bbe=3128690, sbe=3128678
Number missing: bbe-sbe=12
abe=3126280, ebe=3126280
Number missing: abe-ebe=0
Starts vs ends: sbe-ebe=2398, bbe-abe=2410
num_exit=12

To me this indicates that the block is checking for the exit status ahead of start_block_exec. The fact that this only occurs 12 times in the entire recording makes me skeptical that this is indeed the issue.

This does appear to be something you can check because even if the block is exited early our after_block_exec callback occurs. This means you could simulate this particular condition by enabling after_block_exec as well and only examining the output if the exit condition is greater than 1.

AndrewFasano commented 6 months ago

Thanks Luke, that's helpful info. I'm glad the EBE callback isn't called when a block is interrupted, that's a nice improvement over ABE!

I can try building some more tests for this PR later, but I'm not sure how this PR could (in itself) cause the non-deterministic behavior unless something it's built on is more fundamentally broken 😨

MarkMankins commented 6 months ago

I tested with the latest code and I'm still seeing issues where callstack_instr outputs different results for the same replay.

AndrewFasano commented 6 months ago

I haven't forgotten about this, just haven't had time to track down the issue. Things are working correctly on the main dev branch, right?

MarkMankins commented 6 months ago

Yes, no issues on dev.

MarkMankins commented 6 months ago

Oh - sorry - didn't realize the recent push was just a rebase - I thought you made some changes - ignore me. :)

lacraig2 commented 6 months ago

I think I know what is going wrong here.

It looks like our implementation of start_block_exec runs prior to the first instruction, but not prior to the initial TCG block elements.

In this case those initial TCG checks mean that if an exit event has occurred we would generate a before_block_exec, but not a start_block_exec.

These are fairly rare, but over the course of millions of blocks it will occur.

I had started testing out a change to start_block_exec that adds functionality earlier in the block, but I ran into some issues with LLVM generation failing. I'll likely follow up with a PR.

AndrewFasano commented 6 months ago

@lacraig2 so if that's the issue it would mean the SBE callback is sometimes (non-deterministically) triggered once when it should be triggered twice? Are we going to have to add an exitCode to the EBE callback to indicate if a block was ended early and then just keep the existing callstack instr logic for filtering out double calls.

lacraig2 commented 6 months ago

I think I've found the issue. It seems like SBE should check panda_exit_loop.

I've implemented this as a separate PR #1471, but I have cherry picked it onto this branch so that @MarkMankins can check it with his tests. I have already confirmed that this fixes the issues for the file you sent me.

Once we confirm that this works I'd like to revert this commit and merge #1471 and then this PR.