mstange / samply

Command-line sampling profiler for macOS and Linux
Apache License 2.0
2.04k stars 49 forks source link

Running samply 'in-proc' #158

Open bruno-garcia opened 3 months ago

bruno-garcia commented 3 months ago

After a convo earlier this week, I wonder if folks familiar with the code base here know what would it take to have run samply in-process. At a reduce sample rate in order to manage the overhead and have it run in production.

vvuk commented 2 months ago

This is tricky. For macOS, this is doable because samply does its own capture (by suspending threads and grabbing a call stack).

For Linux and Windows, samply relies on system-provided functionality (perf and ETW) for profiler capture. So it's not actually clear what running it "in process" really would mean there, since the in process profile capture piece would just be setting up the system facilities and processing events. But doing that in-proc is nearly identical to doing it out of process.

What you could do though is implement the macOS approach for in-proc usage for all platforms. This would be a stub that knows how to suspend its own process' threads and capture a stack, and then forward it to the rest of samply's machinery for processing and converting into a format that the front end can consume. This wouldn't be a huge amount of effort to get something basic running (this is basically what Firefox does, I believe).

The macOS code in mac/thread_profiler.rs, specifically ThreadProfiler::sample would be what the core of this looks like. Capture a stack and add it to the set of unresolved_stacks/samples, which get processed and flushed to a Profile at the end.

mstange commented 2 months ago

What Vlad said is exactly right - for macOS, an in-process implementation would be relatively straightforward, but an Windows and Linux it would be a fully separate implementation.

That said, the Gecko profiler that is built into Firefox is an in-process implementation.

The two hard bits are:

The first one is tricky because libraries can be unloaded in a racy manner from different threads, and getting library information often involves groveling around the library's loaded bytes, so you must be sure that those bytes don't go away while you're looking. On macOS, Firefox uses _dyld_register_func_for_add_image and _dyld_register_func_for_remove_image. On Linux, Firefox doesn't track newly-added libraries and just gets a snapshot of the libraries that have been loaded by the time at which the profiler is initialized, using /proc/self/maps. On Windows, Firefox also only gets the list of shared libraries once, and it has to increment their reference count using LoadLibraryExW while getting the library information, and has a hack to skip certain libraries for which incrementing the reference count by 1 isn't enough because of extra unbalanced unloads.

Interrupting threads and getting their state is done with SuspendThread / GetThreadContext / ResumeThread on Windows, and with a signal handler on Linux - the sampler thread sends a SIGPROF to the sampled thread for each sample. This post by Nikhil has more information.

roblabla commented 2 months ago

For Linux and Windows, samply relies on system-provided functionality (perf and ETW) for profiler capture. So it's not actually clear what running it "in process" really would mean there, since the in process profile capture piece would just be setting up the system facilities and processing events. But doing that in-proc is nearly identical to doing it out of process.

Heyo, I'm also interested in "in-proc" functionality. However, I'm mostly interested in a way to profile my binary without having to ship a separate binary for the sampling - I don't actually care if the sampling is truly in-process or through an OS subsystem. Is it possible to use ETW/perf to self-sample, or if there was some fundamental reason why that couldn't work?

vvuk commented 2 months ago

Is it possible to use ETW/perf to self-sample, or if there was some fundamental reason why that couldn't work?

It's possible -- I'm not as familiar with perf, but with ETW, any captured etl file with the same providers that samply sets up (see xperf.rs) would work. However, ETW needs an elevated process/admin privileges. I think perf does as well, but there's also a group that users can be added to.