Open clementguidi opened 3 years ago
Looks like a good plan. Thank you for writing this up.
To make it easy to proceed, I'd like to suggest removing the dependency of meson.
Sure, I'm doing this first.
Edit: Actually we don't use meson here. My bad, it was for another PR.
Introduction
We would like to bring safe and concurrent runtime patching for the x86_64 architecture, so binaries that are not compiled with special options can be instrumented at runtime.
This issue is a place to discuss informally about pull request #1274.
We propose a progressive strategy, in which we want to gradually improve the efficiency of our methods.
Previous work
PR #1274 relies on other work. The most recent commits are based on other pull requests, which should be accepted first. See below. Once they are merged, PR #1274 can be stripped from the commits that belong to previous PR, so it is actually smaller.
Client server architecture
The client command discussed in #1269 serves as an entry point for dynamic patching. It provides a way to forward the name of the functions to (un)patch at runtime to a running uftrace instance.
Meson build system
The Meson build system was introduced in #1214. It is actually independent from this work, but is currently used to compile the new code. Makefile can be supported too.This PR uses the Makefile, not Meson.Patching strategy
Progressive approach
We believe that we should start by implementing simpler methods, which performance may not be production-ready. Users can thus get familiar with the new features and use them in basic cases.
Once these changes get accepted, we can work on more sophisticated and fine grained methods, that enhance the performance (coverage, overhead) but need more effort.
See for example NOProbe[1] and Instruction punning[2].
Current implementation
The proposed PR #1274 works as follows. Various hashmaps are used to store the correspondance between original instructions and the location of trampolines.
Step 1 - Insert temporary int3 trap
Fist we insert a 1 byte int3 trap, so incoming threads will be interrupted before reaching the critical section located after it, that is to be modified (the patching region).
The signal handler redirects the threads to an out of line execution (OLX) buffer.
Step 2 - Move threads out of the critical section
Then, we make sure that no thread is currently executing code in the patching region. We send a SIGRTMIN+n signal to each thread. The signal handler checks that the instruction pointer isn't in any patching region. If so, it redirects it to an OLX.
Step 3 - Patch the functions
The third step is to patch the now-safe regions with the address for the jump instruction. The int3 trap remains untouched, and will be replaced at the end of the process.
Before doing so, the processors are synchronized (membarrier system call with MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).
Step 4 - Remove the int3 trap
Eventually, the trap is replaced atomically by the first byte of the jump instruction, effectively redirecting to the trampoline.
Signals
As discussed in https://github.com/namhyung/uftrace/pull/1274#discussion_r652003950, we now only send one SIGRTMIN+n signal to each thread for a batch of (un)patching.
Testing
The current work has yet to be tested in a production environment.
However, the (un)patching mechanism has been stress-tested, by continuously patching and unpatching functions of a multi-threaded program running a loop. No crashes were reported.
Issues
Currently, only x86_64 is covered. We need to disable the new code on other architectures, so users of these platforms won't be misled.
Literature
[1] NOProbe : A Fast Multi-Strategy Probing Technique for x86 Dynamic Binary Instrumentation https://amdls.dorsal.polymtl.ca/files/progressMeetingMay2020_abalboul.pdf
[2] Instruction punning: lightweight instrumentation for x86-64 https://dl.acm.org/doi/10.1145/3062341.3062344