panda-re / panda

Platform for Architecture-Neutral Dynamic Analysis
https://panda.re
Other
2.48k stars 479 forks source link

Callstack_instr: Is popping up to 10 entries sensible? #1513

Open Coffeeri opened 3 months ago

Coffeeri commented 3 months ago

Hello there,

I am currently using an approach similar to PANDA, embedding callbacks into QEMU's v8 source code for malware research in my master's thesis. One of my core needs is to keep track of currently called, but not yet returned, procedures. This is where a shadow callstack becomes very useful.

My shadow callstack implementation (in C) encountered some issues with determining when to pop and ensuring the return is correct. My approach was to add boolean variables is_call and is_ret (with respect to the parsed opcodes in that TB) to the TranslationBlock data structure at translation time.

On a call, I push the pc to the callstack and hook the return address. When the hook is executed, it pops the callstack until it reaches a stored stack index referring to the earlier pushed pc.

Coming along your implementation of the callstack_instr plugin I have learned about the interrupts! Earlier I have totally missed them, which probably screwed my shadow callstack.

What confuses me about your implementation is that you pop up to 10 TBs before TB execution (if it was not interrupted):

// Search up to 10 down
for (int i = v.size() - 1; i > ((int)(v.size() - 10)) && i >= 0; i--) {
  if (tb->pc == v[i].pc) {
    PPP_RUN_CB(on_ret, cpu, w[i]);
    v.erase(v.begin() + i, v.end());
    w.erase(w.begin() + i, w.end());
    break;
  }
}

My new approach uses an additional map that is stack-segregated, with the return address (tb->pc + tb->size) as the key and a simple uint as a counter. On every call, I add the return address with a count of 1, or increment it if it already exists.

Before every TB execution, I check whether the current pc is in that map. If so, I pop the callstack until the return address is found. Then, I decrease the related value in the map by one or remove it if it would be zero.

I'm confused and wondering if I'm misunderstanding something. I'd love some feedback if my approach is totally off and the current one of callstack_instr is the sensible.

AndrewFasano commented 3 months ago

Hi Coffeeri,

It's exciting to hear about efforts to build dynamic analyses on top of QEMU's modern plugin architecture - we've been exploring that direction and think it's a really promising area for future development! We have an issue about moving panda functionality to qemu plugins (#1383), a qemu fork where we've done some of this work, and we issued a (never merged) patch to upstream trying to add support for allowing plugins to interact with one another.

As for your question with callstack_instr, I'm not too sure what's going on in our implementation. The search for 10 blocks has been there since @moyix wrote it 12 years ago in PANDA 1.0 here. I suspect this might be trying to account for false positives - if a value is incorrectly added to the shadow stack, we'd skip past it if we later return to a PC that wasn't on the top of our shadow stack. Then we clean up the shadow stack and trigger the on_ret callback. But I'm not too confident about that - if you try with/without support for skipping past the head of the shadow stack on return in your implementation, maybe you'll see some different behavior?

Also, it's not related to your question, but one hard-learned lesson about callstack_instr is that libcapstone's support for detecting calls and rets has significantly improved in more recent versions of the project and that underpin callstack_instr's detection logic. I believe libcapstone v3 worked reasonably well for x86 but was quite bad for other architectures. v4 was a bit better for arm/mips, and now v5 seems to also handle aarch64 reasonably well. Depending on which architecture you're testing with, it might be worth ensuring you have that dependency up to date.

Coffeeri commented 3 months ago

Thank you for your answer!

I am actively following your effort to Allow TCG plugins to read memory. Last semester, I implemented memory propagation but didn't dare to suggest my hacky solution to the mailing list. In hindsight, my solution was analogous to the proposal, but it couldn't be trivially migrated to the newest major version of QEMU, without certain time efforts I couldn't spare. I hope for the best that your proposals and changes will be merged.

Regarding your explanation, I assumed it was a way to counter false positives. The suggested additional map would not require this step. If you are going to re-implement the plugin in your new architecture, I will be happy to take a look and may contribute if help is needed.

My suggestion does not require Capstone as a dependency. As we have total control of the translation step, I found translating instructions only to later disassemble them again using Capstone to be unnecessary overhead. However, I definitely understand your approach, which aims to be mostly architecture-independent (I am focusing on x86(_64) only) without modifying too many moving parts of the TCG core. I don't have a better generic solution to this, which has potential to be compatible with QEMUs plugin-system, while leaving the TGC core mostly untouched.

You may close this issue if no change to the current state of callstack_intr is wanted/ required.

lacraig2 commented 3 months ago

@Coffeeri I'd be quite interested in a callstack_instr (or callstack_instr2 following the usual pattern) that uses the existing translation from TCG. I've thought about this at various points, but never quite got a chance to update it.

I am actively following your effort to Allow TCG plugins to read memory.

Last semester, I implemented memory propagation but didn't dare to suggest my hacky solution to the mailing list. In hindsight, my solution was analogous to the proposal, but it couldn't be trivially migrated to the newest major version of QEMU, without certain time efforts I couldn't spare. I hope for the best that your proposals and changes will be merged.

I appreciate that you're following this issue. I'm pushing on this issue in part because I suspect that there have been many separate implementations of a lot of the core infrastructure that QEMU is lacking in the plugin sphere. I know of at least a handful.

I've found that even if you don't have time to provide code to communities like QEMU it's still worth your time to provide perspective and opinions based on your experiences.