fdm eats child_fetch exit codes because of wait_children unreachable code, which can lead to mail data loss

0xkag commented 3 years ago

If a child_fetch child exits non-zero (or any child waited on by wait_children() in fdm.c) this exit status is lost. This is because the return (retcode) on fdm.c:262 is unreachable code. The two return (0) statements on lines 206 and 209 or the fatal() on line 210 will always be what terminates the for (;;) loop. (The fatal() is not the problem, it's the other two.) The fix is to change lines 206 and 209 to return (retcode). retcode starts as 0 and only mutates non-zero (never back to 0) for the duration of the wait_children() call.

Losing a child error can result in the apparent delivery of a message, especially when fdm is used as a local delivery agent.

I noticed this because I recently switched from procmail to fdm and saw a large message get accepted for delivery, was not delivered, and the MTA logs showed no delivery deferral and mail.err showed lines like this:

Apr  5 00:00:24 mymachine fdm[24074]: stdin: failed to resize mail: Cannot allocate memory
Apr  5 00:00:24 mymachine fdm[24074]: stdin: fetching error. aborted

(Or similar; in reproduction sometimes it's the deliver-rewrite.c variant of that message instead of the fetch-stdin.c variant cited above, depending upon config.)

And the mail was lost when an error was hit at this stage because the MTA considered it delivered.

Now, in my case, this was due to a very low memory soft limit. But that shouldn't matter, if it can't be delivered the LDA must exit non-zero.

To reproduce:

% cat genmessage
from email.message import EmailMessage
import smtplib
msg = EmailMessage()
msg.set_content('\n'.join(['a' * 100 for i in range(20000)]))
msg['Subject'] = 'Test'
msg['From'] = 'me@example.com'
msg['To'] = 'me@example.com'
msg.set_content('\n'.join(['a' * 100 for i in range(40000)]))
print(msg)
% python3 genmessage | prlimit --stack=$((2**24)) --data=$((2**24)) --as=$((2**24)) /usr/bin/fdm -vvv -h fetch; echo $?

Running this shows that even though there's an error it exits 0:

...
parent: got message type 2, id 0 from child 3778
parent: sending exit message to child 3778
parent: 2 children, 0 dead children
stdin: fetch loop start
stdin: got message type 2, id 1
stdin: trying (deliver) message 1
stdin: deliver message 1, error
stdin: fetching error. aborted
stdin: 6 file descriptors in use
stdin: 3778: allocated=4526358, freed=4526102, difference=256, peak=4227654
stdin: 3778: mallocs=10420, reallocs=8, frees=10419
stdin: 3778: 1, deliver_rewrite_deliver+0x172: [0x1d0bc08 256: aaaaaaaa]
stdin: 3778: 1 unfreed blocks
stdin: 0 messages processed in 0.542 seconds
stdin: polled for 0.194 seconds (0.194 blocked)
stdin: finished processing. exiting
stdin: sending exit message to parent
stdin: waiting for exit message from parent
parent: got message type 1, id 0 from child 3774
parent: sending exit message to child 3774
parent: 2 children, 0 dead children
parent: child 3778 socket error
parent: 2 children, 0 dead children
stdin: 3 file descriptors in use
stdin: 3774: allocated=284468, freed=284468, difference=0, peak=280964
stdin: 3774: mallocs=92, reallocs=6, frees=92
stdin: 3774: 0 unfreed blocks
parent: waiting for children
parent: child 3774 returned 1
parent: child 3774 died: killing 3778
parent: waiting for children
parent: 1 children, 1 dead children
parent: waiting for children
parent: child 3778 returned 0
parent: waiting for children
parent: waiting for children
parent: finished, total time 0.547 seconds
parent: 3 file descriptors in use
parent: 3773: allocated=104329, freed=104329, difference=0, peak=80876
parent: 3773: mallocs=532, reallocs=64, frees=532
parent: 3773: 0 unfreed blocks
0

With this patch:

diff --git a/fdm.c b/fdm.c
index a35393a..c09edf9 100644
--- a/fdm.c
+++ b/fdm.c
@@ -203,10 +203,10 @@ wait_children(
                /* Wait for a child. */
                switch (pid = waitpid(WAIT_ANY, &status, flags)) {
                case 0:
-                       return (0);
+                       return (retcode);
                case -1:
                        if (errno == ECHILD)
-                               return (0);
+                               return (retcode);
                        fatal("waitpid failed");
                }

@@ -947,5 +947,6 @@ out:
        free_strings(&conf.excl);
 #endif

+       log_debug2("parent: exit code %d", res);
        exit(res);
 }

It exits 1:

...
parent: got message type 2, id 0 from child 3913
parent: sending exit message to child 3913
parent: 2 children, 0 dead children
stdin: fetch loop start
stdin: got message type 2, id 1
stdin: trying (deliver) message 1
stdin: deliver message 1, error
stdin: fetching error. aborted
stdin: 0 messages processed in 0.606 seconds
stdin: finished processing. exiting
stdin: sending exit message to parent
stdin: waiting for exit message from parent
parent: got message type 1, id 0 from child 3912
parent: sending exit message to child 3912
parent: 2 children, 0 dead children
parent: waiting for children
parent: child 3912 returned 1
parent: child 3912 died: killing 3913
parent: waiting for children
parent: 1 children, 1 dead children
parent: 1 children, 1 dead children
parent: child 3913 socket error
parent: 1 children, 1 dead children
parent: waiting for children
parent: child 3913 returned 1
parent: waiting for children
parent: waiting for children
parent: finished, total time 0.610 seconds
parent: exit code 1
1

With this patch in place, and used as a LDA, that large mail that got dropped before gets delivery deferred which is what is desired.

This is on Linux i686, but I doubt that matters:

% uname -a
Linux <redacted> 5.10.13-x86-linode161 #1 SMP Thu Feb 4 13:51:28 EST 2021 i686 GNU/Linux

And master HEAD:

% git rev-parse HEAD
d911b0c1121fbdee527d8786cb68f4959d785cff

It appears this bug has been present since 2009 according to history.

I did not examine other types of children for exit code propagation issues.

nicm commented 3 years ago

Applied, thanks!

ghost commented 3 years ago

On Fri 09 Apr 2021, Nicholas Marriott wrote:

Applied, thanks!

Nixpkg maintainer for fdm here. This seems to be the first “serious” issue to be addressed since 2.0 was released. Is a 2.1 release on the horizon? No worries if not, I am more than happy to apply this patch over in Nixpkg.

0xkag commented 3 years ago

@nicm I think you might want to reopen. After fixing this, and running for a few days, I think it uncovered a race between terminating children in the event loop. wait_children() sends SIGTERM children of children of the parent that have terminated, before waiting for all children that might have already terminated. I think. I tried to fix it but haven't been able to / had the time to dive deeper.

0xkag commented 3 years ago

To elaborate, if there's a regular child that is related (the exact process tree isn't quite clear to me; I'm not sure if the child has children or if they're all really just children of the parent, so I'm just saying "related") to a delivery child, and the delivery child delivers the message, and waits for its "exit message" from the parent, meanwhile the regular child has already been sent and received its message exit, the wait loop & event loop will try to kill the delivery child because it's related to the regular child and the regular child has terminated. Phew. Hope that helps describe the issue.

0xkag commented 3 years ago

Just want to mention this again. I think fixing this bug here has likely uncovered another bug that I described in the last two updates, and without fixing that master is likely unstable.

nicm commented 3 years ago

I am not sure how this change could make any difference, it is just returning the correct return code when we run out of children. The behaviour when we find a child is the same as it always has. And if the delivery child has failed, we want to kill any other children, there is no point in them continuing, right?

m-v-b commented 2 years ago

Hi,

I have encountered the same issue (i.e., exit code being intermittently non-zero due to prematurely killing child processes) reported by @0xkag after this issue had been closed. I have been using a binary built with the following patch without issues in the past months to work around this issue. I hope that this will be helpful.

0001-Fix-race-condition-involving-child-EXIT-messages.patch.txt

nicm commented 2 years ago

I have applied this and we can see how it goes.

nicm / fdm

fdm eats child_fetch exit codes because of wait_children unreachable code, which can lead to mail data loss #94