pwait and multi-threaded processes

tbetker-rs commented 5 years ago

Whenever a thread of a multithreaded process terminates, pwait receives a PROC_EVENT_EXIT message where

process_tgid is the PID (process ID), and
process_pid is the TID (thread ID, as in /proc/PID/task/TID).

pwait only looks at process_pid, i.e., it waits until the main thread of the process terminates (TID == PID). There are at least two issues with that, and we have been hit by both of them:

When the main thread terminates, e.g., due to a SIGSEGV, or just by calling pthread_exit(), pwait exits although the process may still be alive (because other threads are still running).
When the main thread terminates before pwait is started, and the process exits afterwards, pwait will hang forever (because it will never receive the message it is waiting for).

What we would like to see is that pwait exits when the PID is no longer valid. The basic idea would be to look at process_tgid, check the PID, and exit when the PID is gone.

However, it turns out that there is another problem: The PID may still be valid for some time after the last PROC_EVENT_EXIT. I tested the following conditions to check the PID, and none of them worked reliably (in fact, the error rate is about 50% on our system):

kill(pid, 0) != -1 || errno != ESRCH
getpgid(pid) > 0
access("/proc/PID", F_OK) == 0

So what I am doing now is start a timer when a PID checks out as valid, then re-check three times after 1s, and 2s, and 3s. (Usually, the PID is gone after a few ms, but we saw cases where a single 100ms or even 1s timeout would not suffice, probably due to system load.) I can provide my source code if you are interested.

Yes, this is ugly, but at least it works for us. Let me add that I really liked your idea of implementing pwait by netlink sockets, and that I very much wanted it to work for us. Perhaps you or another user will come up with a better solution, but at the moment, this workaround is all I can suggest.

leahneukirchen commented 5 years ago

The thing with the tpids is a bug, and should be fixed indeed.

I'm not a fan of the timers and would like to avoid hacks like these.

In the future, extrace may use a BPF based interface instead (I haven't checked yet if these are more robust).

leahneukirchen commented 5 years ago

(The obvious problem with your approach is that the PID could become valid again for a different process in the mean time.)

leahneukirchen / extrace

pwait and multi-threaded processes #7