clementguidi commented 3 years ago

Introduction

We would like to bring safe and concurrent runtime patching for the x86_64 architecture, so binaries that are not compiled with special options can be instrumented at runtime.

This issue is a place to discuss informally about pull request #1274.

We propose a progressive strategy, in which we want to gradually improve the efficiency of our methods.

Previous work

PR #1274 relies on other work. The most recent commits are based on other pull requests, which should be accepted first. See below. Once they are merged, PR #1274 can be stripped from the commits that belong to previous PR, so it is actually smaller.

Client server architecture

The client command discussed in #1269 serves as an entry point for dynamic patching. It provides a way to forward the name of the functions to (un)patch at runtime to a running uftrace instance.

[x] Need to solve issue #1330 before

Meson build system

~~The Meson build system was introduced in #1214. It is actually independent from this work, but is currently used to compile the new code. Makefile can be supported too.~~ This PR uses the Makefile, not Meson.

Patching strategy

Progressive approach

We believe that we should start by implementing simpler methods, which performance may not be production-ready. Users can thus get familiar with the new features and use them in basic cases.

Once these changes get accepted, we can work on more sophisticated and fine grained methods, that enhance the performance (coverage, overhead) but need more effort.

See for example NOProbe[1] and Instruction punning[2].

Current implementation

The proposed PR #1274 works as follows. Various hashmaps are used to store the correspondance between original instructions and the location of trampolines.

Step 1 - Insert temporary int3 trap

Fist we insert a 1 byte int3 trap, so incoming threads will be interrupted before reaching the critical section located after it, that is to be modified (the patching region).

The signal handler redirects the threads to an out of line execution (OLX) buffer.

Step 2 - Move threads out of the critical section

Then, we make sure that no thread is currently executing code in the patching region. We send a SIGRTMIN+n signal to each thread. The signal handler checks that the instruction pointer isn't in any patching region. If so, it redirects it to an OLX.

Step 3 - Patch the functions

The third step is to patch the now-safe regions with the address for the jump instruction. The int3 trap remains untouched, and will be replaced at the end of the process.

Before doing so, the processors are synchronized (membarrier system call with MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

Step 4 - Remove the int3 trap

Eventually, the trap is replaced atomically by the first byte of the jump instruction, effectively redirecting to the trampoline.

Signals

As discussed in https://github.com/namhyung/uftrace/pull/1274#discussion_r652003950, we now only send one SIGRTMIN+n signal to each thread for a batch of (un)patching.

Testing

The current work has yet to be tested in a production environment.

However, the (un)patching mechanism has been stress-tested, by continuously patching and unpatching functions of a multi-threaded program running a loop. No crashes were reported.

Issues

Currently, only x86_64 is covered. We need to disable the new code on other architectures, so users of these platforms won't be misled.

Literature

[1] NOProbe : A Fast Multi-Strategy Probing Technique for x86 Dynamic Binary Instrumentation https://amdls.dorsal.polymtl.ca/files/progressMeetingMay2020_abalboul.pdf
[2] Instruction punning: lightweight instrumentation for x86-64 https://dl.acm.org/doi/10.1145/3062341.3062344

namhyung commented 3 years ago

Looks like a good plan. Thank you for writing this up.

To make it easy to proceed, I'd like to suggest removing the dependency of meson.