panda-re / panda

Platform for Architecture-Neutral Dynamic Analysis
https://panda.re
Other
2.47k stars 478 forks source link

Proposal: PANDA and Non-Deterministic Log Versioning #339

Closed nathanjackson closed 3 years ago

nathanjackson commented 6 years ago

This issue is to continue discussion from #322 to address the non-deterministic log format and versioning issues.

I've used applications that use MsgPack and that's how I know about it, but otherwise I don't really have development experience with BSON or MsgPack. Some brief searches sound like MsgPack might be the way to go. See: https://stackoverflow.com/questions/6355497/performant-entity-serialization-bson-vs-messagepack-vs-json. Another plus with MsgPack is that they have a ton of different APIs for various languages, so it would be pretty straightforward to write scripts that process the non-deterministic log if need be. The downside is that using MsgPack or BSON is going to make the non-deterministic logs larger because they are schema-less. So basically we'd trade compactness for flexibility. Although I suspect in practice, it might not be that big of a deal, especially if there is some compression scheme used.

Right now, a colleague and I both have distinct features that add callsites to the log format. If we merged today then we'd invalidate each others recordings, one of us would have to regenerate our test cases or go through the trouble of creating some sort of converter. A schemaless format would ensure we don't have this issue. Versioning would help, but then it becomes a "race" to who gets theirs merged first.

Another nice thing to add as @moyix pointed out would be the ability to add metadata to the log. One use case that I just ran into would be the network plugin. The PCAP format has a timestamp in it and right now every time you run the replay to generate the PCAP file, you get timestamps at the time of replay. By adding timestamp metadata, we could recreate the exact same PCAP file every time with timestamps taken at the time of recording.

I think you have to version PANDA, especially if its desirable to bring in commits from upstream QEMU on a semi-regular basis. Once PANDA 2.0.0 is "released", we could say its based on QEMU 2.9.1. With semantic versioning: https://semver.org/, we'd also impose versioning on the PANDA core APIs. Although one tricky part is that plugins can have their own APIs, but I would say "core" plugins like OSI, taint 2, etc should be versioned under the PANDA core. The problem then lies in what is a "core" plugin. Some are more obviously "core" than others. Alternatively, PANDA plugins could be split from the mainline PANDA, that way the main PANDA repo contains a solid, stable core that provides the record/replay, LLVM translation, and plugin capability.

moyix commented 6 years ago

I'll add on to this briefly (more comments forthcoming!) that I definitely think timestamps would be helpful – I recently talked to someone who was trying to do experiments that tracked the rate at which various events in a replay happened, in terms of the time at recording. Right now the only way to do that is to introspect into the VM and read the guest time as stored by the OS (but this could be wrong).

github-actions[bot] commented 3 years ago

Stale issue message