Closed 0xkag closed 3 years ago
Applied, thanks!
On Fri 09 Apr 2021, Nicholas Marriott wrote:
Applied, thanks!
Nixpkg maintainer for fdm here. This seems to be the first “serious” issue to be addressed since 2.0 was released. Is a 2.1 release on the horizon? No worries if not, I am more than happy to apply this patch over in Nixpkg.
@nicm I think you might want to reopen. After fixing this, and running for a few days, I think it uncovered a race between terminating children in the event loop. wait_children()
sends SIGTERM children of children of the parent that have terminated, before waiting for all children that might have already terminated. I think. I tried to fix it but haven't been able to / had the time to dive deeper.
To elaborate, if there's a regular child that is related (the exact process tree isn't quite clear to me; I'm not sure if the child has children or if they're all really just children of the parent, so I'm just saying "related") to a delivery child, and the delivery child delivers the message, and waits for its "exit message" from the parent, meanwhile the regular child has already been sent and received its message exit, the wait loop & event loop will try to kill the delivery child because it's related to the regular child and the regular child has terminated. Phew. Hope that helps describe the issue.
Just want to mention this again. I think fixing this bug here has likely uncovered another bug that I described in the last two updates, and without fixing that master
is likely unstable.
I am not sure how this change could make any difference, it is just returning the correct return code when we run out of children. The behaviour when we find a child is the same as it always has. And if the delivery child has failed, we want to kill any other children, there is no point in them continuing, right?
Hi,
I have encountered the same issue (i.e., exit code being intermittently non-zero due to prematurely killing child processes) reported by @0xkag after this issue had been closed. I have been using a binary built with the following patch without issues in the past months to work around this issue. I hope that this will be helpful.
0001-Fix-race-condition-involving-child-EXIT-messages.patch.txt
I have applied this and we can see how it goes.
If a
child_fetch
child exits non-zero (or any child waited on bywait_children()
infdm.c
) this exit status is lost. This is because thereturn (retcode)
on fdm.c:262 is unreachable code. The tworeturn (0)
statements on lines 206 and 209 or thefatal()
on line 210 will always be what terminates thefor (;;)
loop. (The fatal() is not the problem, it's the other two.) The fix is to change lines 206 and 209 toreturn (retcode)
.retcode
starts as 0 and only mutates non-zero (never back to 0) for the duration of thewait_children()
call.Losing a child error can result in the apparent delivery of a message, especially when fdm is used as a local delivery agent.
I noticed this because I recently switched from procmail to fdm and saw a large message get accepted for delivery, was not delivered, and the MTA logs showed no delivery deferral and mail.err showed lines like this:
(Or similar; in reproduction sometimes it's the
deliver-rewrite.c
variant of that message instead of thefetch-stdin.c
variant cited above, depending upon config.)And the mail was lost when an error was hit at this stage because the MTA considered it delivered.
Now, in my case, this was due to a very low memory soft limit. But that shouldn't matter, if it can't be delivered the LDA must exit non-zero.
To reproduce:
Running this shows that even though there's an error it exits 0:
With this patch:
It exits 1:
With this patch in place, and used as a LDA, that large mail that got dropped before gets delivery deferred which is what is desired.
This is on Linux i686, but I doubt that matters:
And
master
HEAD:It appears this bug has been present since 2009 according to history.
I did not examine other types of children for exit code propagation issues.