Open tbetker-rs opened 5 years ago
The thing with the tpids is a bug, and should be fixed indeed.
I'm not a fan of the timers and would like to avoid hacks like these.
In the future, extrace may use a BPF based interface instead (I haven't checked yet if these are more robust).
(The obvious problem with your approach is that the PID could become valid again for a different process in the mean time.)
Whenever a thread of a multithreaded process terminates,
pwait
receives a PROC_EVENT_EXIT message wherepwait
only looks at process_pid, i.e., it waits until the main thread of the process terminates (TID == PID). There are at least two issues with that, and we have been hit by both of them:pwait
exits although the process may still be alive (because other threads are still running).pwait
is started, and the process exits afterwards,pwait
will hang forever (because it will never receive the message it is waiting for).What we would like to see is that
pwait
exits when the PID is no longer valid. The basic idea would be to look at process_tgid, check the PID, and exit when the PID is gone.However, it turns out that there is another problem: The PID may still be valid for some time after the last PROC_EVENT_EXIT. I tested the following conditions to check the PID, and none of them worked reliably (in fact, the error rate is about 50% on our system):
So what I am doing now is start a timer when a PID checks out as valid, then re-check three times after 1s, and 2s, and 3s. (Usually, the PID is gone after a few ms, but we saw cases where a single 100ms or even 1s timeout would not suffice, probably due to system load.) I can provide my source code if you are interested.
Yes, this is ugly, but at least it works for us. Let me add that I really liked your idea of implementing
pwait
by netlink sockets, and that I very much wanted it to work for us. Perhaps you or another user will come up with a better solution, but at the moment, this workaround is all I can suggest.