leahneukirchen / extrace

trace exec() calls system-wide
Other
122 stars 9 forks source link

pwait and multi-threaded processes #7

Open tbetker-rs opened 5 years ago

tbetker-rs commented 5 years ago

Whenever a thread of a multithreaded process terminates, pwait receives a PROC_EVENT_EXIT message where

pwait only looks at process_pid, i.e., it waits until the main thread of the process terminates (TID == PID). There are at least two issues with that, and we have been hit by both of them:

What we would like to see is that pwait exits when the PID is no longer valid. The basic idea would be to look at process_tgid, check the PID, and exit when the PID is gone.

However, it turns out that there is another problem: The PID may still be valid for some time after the last PROC_EVENT_EXIT. I tested the following conditions to check the PID, and none of them worked reliably (in fact, the error rate is about 50% on our system):

So what I am doing now is start a timer when a PID checks out as valid, then re-check three times after 1s, and 2s, and 3s. (Usually, the PID is gone after a few ms, but we saw cases where a single 100ms or even 1s timeout would not suffice, probably due to system load.) I can provide my source code if you are interested.

Yes, this is ugly, but at least it works for us. Let me add that I really liked your idea of implementing pwait by netlink sockets, and that I very much wanted it to work for us. Perhaps you or another user will come up with a better solution, but at the moment, this workaround is all I can suggest.

leahneukirchen commented 5 years ago

The thing with the tpids is a bug, and should be fixed indeed.

I'm not a fan of the timers and would like to avoid hacks like these.

In the future, extrace may use a BPF based interface instead (I haven't checked yet if these are more robust).

leahneukirchen commented 5 years ago

(The obvious problem with your approach is that the PID could become valid again for a different process in the mean time.)