Open expend20 opened 3 years ago
Hey, thanks for taking the time to look into all of these mechanisms and sharing your ideas. The "put int3s on whole code section" idea did come up several times before (also recently due to problems with ARM64 macOS, specifically libraries in the dyld cache no longer starting/ending on a page boundary). However, as you noted, without a way to deterministically and in a performant way distinguish code and data, it results in a much less clean model than the current way of using memory protection flags. I am not a fan of using external tooling to get code/data information because (a) I think it would make TinyInst more complicated to use and (b) external tools can (and do occasionally) also fail to distinguish code/data correctly. So this is still an open problem.
However, I'd like to point out that using breakpoints is not the only way of solving slowdowns due to entries. This comes from the way in which entries occur, basically 1) calls to exported functions 2) calls to function pointers or virtual methods when objects get passed around between libraries
The first case could be solved by going through the imports of every loaded library and replacing the instrumented library's imports with their translated equivalents (via TinyInst::GetTranslatedAddress
function).
The second case can be solved e.g. by collecting information about entries during the first run, and then going over the instrumented library's data and replacing all instances of pointers to these with their translated equivalents. In theory this is prone to fail if some data has the same value as the pointer we want to replace, but, especially on 64-bit targets, the chances of that are pretty low.
Sup! Thanks for the answer,
thanks for taking the time to look into all of these mechanisms and sharing your ideas.
Not at all, I'm currently implementing my own instrumentor alongside inspired by your ideas :)
The first case could be solved by going through the imports of every loaded library and replacing the instrumented library's imports with their translated equivalents (via TinyInst::GetTranslatedAddress function).
yep, this one is trivial, but there is next one
The second case can be solved e.g. by collecting information about entries during the first run, and then going over the instrumented library's data and replacing all instances of pointers to these with their translated equivalents. In theory this is prone to fail if some data has the same value as the pointer we want to replace, but, especially on 64-bit targets, the chances of that are pretty low.
This one doesn't look clear to me.
Let's say we have something like GetProcAddress(<something>)
the returning pointer to the code could be stored: in register (for immediate call after interface query), on stack, on heap. Interface could be queried multiple times. So, checking whole memory of the process in searching for the pointers to the instrumented module would be slow because you need to enumerate all stack and heap memory, and it should be performed multiple times, not only during first instrumentation.
Let's say we have something like QueryInterface()
, which returns pointer to the vtable. Vtable actually an array of pointers which lays down in data section and points to code section. We don't have the information where it ends, and we can't patch blindly all the references from data to code, because it could be theoretically something else even despite 64-bit addressing (probably exception information?).
I know it's only a speculation right now, in reality it could be easier considering assumptions we could made about the binaries emitted by standard compilers. But sounds a bit vague to me.
This one doesn't look clear to me. Let's say we have something like
GetProcAddress(<something>)
the returning pointer to the code could be stored: in register (for immediate call after interface query), on stack, on heap. Interface could be queried multiple times. So, checking whole memory of the process in searching for the pointers to the instrumented module would be slow because you need to enumerate all stack and heap memory, and it should be performed multiple times, not only during first instrumentation.
I think this is quite rare and mostly instead of manually calling GetProcAddress
libraries depend on imports. But note that GetProcAddress
can only find addresses from exported functions (from the export table), so it might be possible to resolve this case by patching the export table. While export table only contains offsets relative to the image base and not absolute addresses, if the instrumented code can be allocated within 2GB after the original code it could still work.
Let's say we have something like
QueryInterface()
, which returns pointer to the vtable. Vtable actually an array of pointers which lays down in data section and points to code section. We don't have the information where it ends, and we can't patch blindly all the references from data to code, because it could be theoretically something else even despite 64-bit addressing (probably exception information?).
We wouldn't blindly patch everything that looks like a pointer. Let's say on one run you observe an entry at 0x7fff12345678. You would then search the instrumented library's data specifically for the value 0x7fff12345678 and then replace only that one. Still possible to make a mistake though and if it gets implemented in the TinyInst it should be guarded by a flag and never enabled by default, but the chance of error would be much smaller than if we replaced everything that looks like a data->code pointer.
FYI there is now an implementation of the "search and replace pointers to the previously observed entrypoints with their instrumented equivalents" idea using -patch_module_entries
flag. It's not perfect but seems to work (at least for my current target which uses JIT and module->JIT->module callbacks caused a large amount of entries).
Interesting, let me look :)
Hello @ifratric! I really enjoyed an elegant instrumentation idea behind the TinyInst.
However, I was thinking about reducing the slowdown caused by "entries" into the instrumented module and first idea that came to my mind was next.
Why not to put int3s on whole code section, instrument code as usual, and after that, put jump instead of particular int3s?
Several issues with this approach immediately arisen:
jmp <rel32>
.This could be tackled in several ways, which seams realistically solvable.
This is major problem to me and this is actually my question. Several solutions came into my mind
2.1) It could be solved by taking information about basicblocks from huge disassemblers like IDA or Ghydra (this is what Mesos does) and placing int3s only at the start of the basicblock. This solution works (at least for my tests on regular Microsoft's dlls), but requires additional dependency.
2.1) Instrument each indirect
mov
instruction and check if the data is taken from code section and redirect it to proper data (similarly to the indirect branches current instrumentation). This is actually slow and would be a bit complex task to implement.Am I'm overlooking anything? Maybe there is some fast code flow analysis tactic to distinguish data from code?