panda-re / panda

Platform for Architecture-Neutral Dynamic Analysis
https://panda.re
Other
2.5k stars 480 forks source link

PyPANDA usability #1355

Open AndrewFasano opened 1 year ago

AndrewFasano commented 1 year ago

Looking to start a conversation about making some changes to the PyPANDA APIs and docs to improve usability. These are all subjective design choices so I'm opening an issue for us to discuss instead of starting with a PR.

New function names

Rename some functions to shorter/easier-to-remember names. We could keep the original names for a bit if we made these changes.

Anyone feel strongly about these or have other functions you'd like to see renamed?

Auto-instantiated subclasses

I've also been thinking if we could split up some of our logic into classes the way we've done with panda.arch (with docs here) and then use that as a way to group components of our documentation together. I think this approach makes it easier to search through the documentation because you can first select the thing you know you're interested in (e.g., OSI) and then look through a smaller list of functions. Right now the ctrl-f experience on the main docs page is pretty rough since it also searches all the source code and almost every pypanda function is listed there.

First question - is this worth doing? I think it would be a pretty quick refactor. I'm in favor of it, but curious what others think. If so, here are some ideas of classes we could create. After starting this list I recall we left some big comments splitting panda.py up by groups of functions so we could probably draw some inspiration from those if we wanted to go this route.

Notably I would want us to keep some of the most commonly used functions in the main pandare namespace, though perhaps these could be references to subclass-specific implementations. For example: run_serial_cmd, revert_sync, run, and queue_blocking.

Other things

@MarkMankins - you're a new PyPANDA user so I'd love to hear your thoughts on if these changes would or wouldn't have helped you get up to speed using it. I'll also tag @lacraig2 and @tleek since they're co-designers of the original system and might have more ideas.

tleek commented 1 year ago

Naming functions is difficult. People rarely agree. Maybe we need a style guide and then make a pass over everything to rename.

I prefer naming methods such that you produce, essentially, code that can be read aloud. I also don't love abbreviations since they are often employed haphazardly.

panda.run_serial_cmd() -- that's kinda works: "Panda, run this serial command" -- but do we need to know its via serial port? We are revealing an internal detail for no particular reason.

panda.run_cmd() -- better. But I'd like to add "guest" in there to indicate that this is happening in the guest. I also don't love abbreviations like "cmd"

parnda.run_guest_command() -- that would be my druthers for this. "Panda, run this guest command, please"

tleek commented 1 year ago

As far as "Auto-instantiated subclasses" I think maybe you don't mean auto anything? Just refactoring. Unless I missed the auto part.

I like this idea. I like arch. We could have guest which would contain things that happen in the guest like ... run_command. Ok so that means the above would be panda.guest.run_command() which appeals to me.

MarkMankins commented 1 year ago

The existing documentation was very helpful in getting me up to speed with PyPANDA. It's super easy to look at the docs and see how panda callbacks such as before_block_exec work. All of the examples and tests are also great.

I ran into trouble in a couple of areas.

1) For my first experiment I wanted to see if I could call a function in the stringsearch plugin and register a callback in stringsearch. I struggled with this for a bit, until I ventured outside the generated documentation and looked at some of the examples. Once I realized how to work with the panda.plugins array and panda.ppp decorator I was able to make some progress. I'm not sure how someone could figure out what plugin functions or callbacks were available without looking at the C code. And without the examples I'm not sure I ever would have figured out how to work with a C plugin. If documentation on C plugin methods and callbacks could be generated with stubbed out examples I think that would be really helpful for someone starting with PyPanda - especially if they weren't familiar with the C plugins.

2) Maybe I missed it, but somewhere protobuf==3.13.0 should be listed as a required dependency. Without this all sorts of problems crop up and it isn't super obvious what the problem is. Maybe the version number is something unique to my environment and would be different elsewhere?

lacraig2 commented 1 year ago

@AndrewFasano nice issue.

Addressing the API issue at hand

I think this is worth doing. I like a lot of the subsystems you have identified. I think we could even do something clever like load them on demand.

I totally agree that either dropping or adding warnings to the qemu section would be good.

If we are going to be making major changes I think:

  1. We should try to maintain aliases to these functions for a bit. Even if they do something annoying like say "this function will be deprecated in the next version" I'd like panda.virtual_memory_read to print that than then call panda.mem.virt_read or something.
  2. If we're making major changes we should think big about how we want this API to look and function. I like @tleek's idea of a style guide. I also think some more major API changes could be useful.

The big API issue (IMO): CPUState

In particular, I think we should do something different about CPUState. Currently it's the first argument to every callback and an essential value to pass for many/most functions. I think one of two paths could be good:

Get rid of CPUstate everywhere. This means filtering it from the new callback type and any function that needs it can call panda.get_cpu.

The API here would move from panda.arch.get_arg(cpu, 0) to panda.arch.get_arg(0).

Alternatively, we wrap the cpustate into a Python object that can be interacted with. Rust does this quite well:

let data = cpu.mem_read(buf, len as usize);

but I'd consider also having both mem and arch fall into the CPUState while also maintaining access to the internal CPUState.

Unifying C/C++/Rust documentation with Python

@MarkMankins this is a helpful note. We have had some discussion re:making C internals clearer to Python documentation, but we haven't yet settled on a way to unify that documentation. Ideally, we'd be able to automatically generate it from C/C++/Rust code like we do with Python documentation, but we haven't yet identified a way forward with that.

AndrewFasano commented 1 year ago

I'd definitely want to maintain backwards compatibility for a little bit (as an aside, I have no idea how to estimate how many people would be impacted by pypanda api-breaking changes pushed to dev but I suspect it's just us and now Mark). I think the process for this might look like: 1) Pick a standard / style guide for the API function names 2) Identify the categories of functions we'd want to group together and come up with their names 3) Identify functions we want to drop from the API (if any) 4) Actually make the code changes

As for CPU state - I definitely agree that it's kind of bad as is. I see two things worth thinking about. 1) If we stop passing CPUState around and switch to first_cpu (probably implicitly), we're limiting ourselves to single-core guests - if we move to TCG plugins and qemu 7 we'll need to add the CPU pointers back. If we're planning to eventually get there, maybe it's worth leaving this as is for now. 2) If we're going to make this change, I'd suggest we do it in the C/C++ APIs (and then in the python/rust interfaces to the APIs) instead of just doing it in python.

For unifying docs, did @tleek's branch with kerneldoc-style comments and the infrastructure to generate docs from that get merged? Also I think it would be worth explicitly explaining how pyplugins can interact with C/C++ plugins in the pypanda docs.

AndrewFasano commented 1 year ago

Also, regarding Mark's comment

Maybe I missed it, but somewhere protobuf==3.13.0 should be listed as a required dependency

I'm not sure if that's actually the case and/or how to improve things. I've noticed that the python protobuf package needs to match the system version and on my Ubuntu 22.04 machine, the latest version I can get from pip is way ahead of the latest version from apt.

After 0d98fa49b723a1f623b18de1bbd27d025a208200 (from #1327) protobuf hopefully shouldn't be required to use PyPANDA, unless you're actually trying to do things with protobuf files. You can also set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python in your environment to fix issues with apt/python versions of the package being different.

Overall I'm still fairly confused by what's going on with protobuf versions - not sure if we could/should mandate a specific version as a dependency or if there's something we can do that's better than the current state of things.

lacraig2 commented 1 year ago

If we stop passing CPUState around and switch to first_cpu (probably implicitly), we're limiting ourselves to single-core guests - if we move to TCG plugins and qemu 7 we'll need to add the CPU pointers back. If we're planning to eventually get there, maybe it's worth leaving this as is for now.

I don't think this would necessarily be incompatible with a multi-cpu system. panda_get_cpu doesn't have to return first_cpu. It could return the currently active CPU whichever that might be for the current thread.

Having said that I am starting to come around to option #2 that I talked about above. Where the CPU is a Python object that has methods you can act upon. As opposed to panda.virtual_memory_read you'd get a CPUState and do cpu.virtual_memory_read.

AndrewFasano commented 1 year ago

Oh I think I like that model too!

MarkMankins commented 1 year ago

Maybe it's just the tests that require protobuf. Without protobuf some of the tests fail...

Traceback (most recent call last): File "file_hook.py", line 5, in from pandare import * AttributeError: module 'pandare' has no attribute 'PLogReader'

MarkMankins commented 1 year ago

It would be nice if taint2 would allow one to label a virtual memory address instead of a "RamOffset".

More of a panda problem than a PyPanda problem, but this detail makes things harder in PyPanda than they need to be.

Edit: For reasons I don't understand PyPanda passes physical addresses to taint2 apis and the C plugins pass "ram offsets". PyPanda can't duplicate what the C plugins do because the function to convert physical addresses to ram offsets isn't exposed to PyPanda. But it seems like this works?

lacraig2 commented 1 year ago

@MarkMankins which function does this conversion? It would be fairly easy to add.

MarkMankins commented 1 year ago

PandaPhysicalAddressToRamOffset().

It's defined in common.h: https://github.com/panda-re/panda/blob/dev/panda/include/panda/common.h.

See tstringsearch.cpp for an example of how it is used.