Closed Jake-Shadle closed 7 months ago
Wait, are the processes likely to be children of the minidump-writer
? According to the man page they can strictly only be direct child processes, otherwise ECHILD
is returned.
Oops, I didn't read that far, yah in most cases it will actually be the opposite, the process doing the minidump writing is the parent.
Yeah, my suggestion was because there's only two cases here when using WUNTRACED
: either the processes are in the same process group then the stopped process will indeed stop and waitpid()
will return (which is what we want), or it's not, in which case you get ECHILD
and have to proceed with the regular method of stopping one thread at a time. BTW I don't know how the kernel handles a race where a SIGCONT
is sent after a SIGSTOP
but before the process actually stopped, but I guess it should be handled sanely.
It'd be interesting to try and create a test case for that, but I don't have a clue how one could reliably get into such a situation.
@gabrielesvelto forgive me if I'm missing the detail, but you didn't mention the child relationship in the first case of your comment. As far as I understand, waitpid
cannot be generalized to just the process group, those processes must also still be direct children of the calling process (i.e., waitpid
won't work on process group siblings/ancestors/parents/etc). Is that still a case we think is likely enough to attempt to account for?
Wrt SIGCONT
/SIGSTOP
, I've been assuming that is handled sanely and there is no race. According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html:
When any stop signal (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU) is generated for a process or thread, all pending SIGCONT signals for that process or any of the threads within that process shall be discarded. Conversely, when SIGCONT is generated for a process or thread, all pending stop signals for that process or any of the threads within that process shall be discarded.
But that doesn't guarantee that the Linux implementation meets that requirement.
@gabrielesvelto forgive me if I'm missing the detail, but you didn't mention the child relationship in the first case of your comment. As far as I understand,
waitpid
cannot be generalized to just the process group, those processes must also still be direct children of the calling process (i.e.,waitpid
won't work on process group siblings/ancestors/parents/etc). Is that still a case we think is likely enough to attempt to account for?
Yeah, waitpid()
's Linux documentation confused me a bit, I was under the impression that siblings in a process group could waitpid(..., WUNTRACED)
each other after a SIGSTOP
. I tested it and that's not the case, so that's settled.
As @Jake-Shadle said the process writing the minidumps is often a parent of the minidumped processes, but not always. We could leverage this to do the following:
SIGSTOP
to the target processwaitpid(..., WUNTRACED)
on the target process, if it succeeds we're good. We don't need to deal with a timeout here, because we know that waitpid()
will return after a certain amount of time (if the target process is a child) or will return immediately with a failer (if it isn't), but we can't get stuck on it
waitpid()
failed with ECHILD
then try polling /proc/<pid>/stat
. Here we do need the timeout as we could be racing against another process sending a SIGCONT
to the target process and we'd never see the status become stoppedThere's another upside to using waitpid()
where possible. In the case of Android we might not have to access to /proc
, but we'd still be able to waitpid()
a child. Given this is a best effort attempt I think it's good to cover all scenarios. Unless there's a downside I'm missing.
Wrt
SIGCONT
/SIGSTOP
, I've been assuming that is handled sanely and there is no race. According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html:When any stop signal (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU) is generated for a process or thread, all pending SIGCONT signals for that process or any of the threads within that process shall be discarded. Conversely, when SIGCONT is generated for a process or thread, all pending stop signals for that process or any of the threads within that process shall be discarded.
But that doesn't guarantee that the Linux implementation meets that requirement.
I've checked the kernel just in case, it implements this precisely this way. Once SIGCONT
arrives all pending SIGSTOP
signals are cleared from the process' threads and they're all restarted regardless of whether they had already stopped or not.
As mentioned in this PR, we could use
waitpid
first and then fallback to the current implementation of reading the process status from/proc/<pid>/stat
if thewaitpid
fails for some reason. However, we currently have a timeout (defaulting to 100ms) waiting for the state to change, before going ahead with the process dumping, meaning if we wanted to keep a timeout/best effort stopping, we'd need to do something similar to this to be able to timeout the waitpid if it takes too long. In most cases I'd assume the waitpid would succeed/fail quickly and the timeout wouldn't matter, but that's not guaranteed, and the current polling solution, while less efficient, is quite simple.