Contention monitor / wait graph detector

goldshtn commented 8 years ago

I'd like to propose an idea for a contention monitor / wait graph detector based on eBPF and bcc. I don't have the design all flushed out yet, and would really appreciate feedback. What I would like to do in general is monitor various synchronization mechanisms (user-space and kernel-space) and provide useful diagnostic data.

From the performance perspective, we could capture contention information. Whenever a thread blocks on a lock, we could grab a stack trace and record the number of times and duration spent waiting for the particular lock at the particular call stack.

From a debugging perspective, we could capture lock acquire/release events, and keep track of which lock is owned by which thread (and also which thread is waiting for which lock). Armed with this information, we could construct a wait chain: thread 123 waits for mutex abc owned by thread 456 and so on.

goldshtn commented 8 years ago

@brendangregg What do you think?

brendangregg commented 8 years ago

Yes, we have the opportunity to redo lock analysis.

If you have a copy of Systems Performance, I summarize basic lock analysis on p183, and show the Solaris lockstat(1M) command. (As an aside: lockstat(1M) was the original inspiration for power-of-2 latency histograms as ASCII art). lockstat didn't associate chains, but did everything but. Events included:

Adaptive mutex spin
Adaptive mutex block
Spin lock spin
R/W reader blocked by writer
etc.

And for each event, it showed a stack trace with a corresponding latency histogram, as well as summary statistics. Great.

Except it was very costly to run, because it traced everything. I often traced individual mutex events with times separately, with more success, as I could more easily do so in production.

So be careful with overhead, but you probably already knew that. :)

What you're proposing is associating blocked stacks with held stacks. Which is similar to what offwaketime is doing (and offwaketime should identify these). So can you do it with lower overhead, or with more lock context, or both? Some locks may have symbols, and could be resolved with b.ksym(). As for dynamic locks, maybe naming them after the ctx->ip at birth time would help (or partial stack trace).

Also, how exactly are the locks implemented? In other kernels, the lock itself contains the address of the holder, so if you profile blocked events -- and read the contents of the lock -- you know both blocked + holder. Is that possible in Linux? Sounds like one way to tackle overhead.

For developing this t will be golden to have some test cases that can cause a known level of lock contention. It's easy to write a user-level program to do this (and I have some somewhere), but for the kernel it might need some improvisation.

iovisor / bcc

Contention monitor / wait graph detector #378