trusteddomainproject / OpenDKIM

Other
97 stars 52 forks source link

opendkim/opendkim.c:dkimf_config_free(): don't assert conf->refcnt == 0 #22

Open dilyanpalauzov opened 6 years ago

dilyanpalauzov commented 6 years ago

xxfi_close() is not necessary called, when SIGINT, SIGHUP or SIGTERM are sent to the signal handler thread established by smfi_main(), before smfi_main() returns. For this to happen the underlaying select/poll calls need to return. On some UNIX the poll/select calls exist, when libmilter closes the connection with the MTA, on Linux select/poll exit, when the timeout set by smfi_settimeout() expires.

In turn, when SIGTERM is sent to opendkim, the opened connections are not xxfi_closed() for each previously established xxfi_negotiate(), the reference counter is not decreased to zero and the assert fails.

diff --git a/opendkim/opendkim.c b/opendkim/opendkim.c
--- a/opendkim/opendkim.c
+++ b/opendkim/opendkim.c
@@ -5905,7 +5905,6 @@ static void
 dkimf_config_free(struct dkimf_config *conf)
 {
        assert(conf != NULL);
-       assert(conf->conf_refcnt == 0);

        dkimf_zapkey(conf);
mskucherawy commented 6 years ago

The assertion in there is to detect and report conditions where we call dkimf_config_free() improperly, so I'd prefer to leave it in if possible and fix the underlying problem.

Off the top of my head, we could try moving the refcount increment to mlfi_connect(). It looks like it's possible to make this safe by having mlfi_negotiate() use curconf, and only actually copy a reference to that object and increment the refcount in mlfi_connect(). Does this make sense?

dilyanpalauzov commented 6 years ago

The problem happens on Ctrl+C: the xxfi_close() is not called for existing connections, the counter does not decrease and during shutdown opendkim crashes as the assert fails.

It does not matter if the increment is done is xxfi_connect() or xxfi_negotiate(), the trouble comes from skipping the decrement in xxfi_close().

mskucherawy commented 6 years ago

I'm trying to understand how the call to mlfi_close() could get skipped. In libmilter, every thread calls xxfi_close() on shutdown if one is set, unless the thread is already in ST_QUIT state. Their documentation also says: "xxfi_close is always called once at the end of each connection."

...so it should never be skipped. We will only skip the decrement intentionally in mlfi_close() if the private per-connection status pointer is NULL, but if that's NULL then presumably the decrement already happened (or the increment never did).

Still looking.

dilyanpalauzov commented 6 years ago

On 30 Oct 2017 I proposed per email to Sendmail to update the documentation as follows:

diff --git a/docs/xxfi_close.html b/docs/xxfi_close.html
--- a/docs/xxfi_close.html
+++ b/docs/xxfi_close.html
@@ -60,6 +60,13 @@ In particular, it is incorrect to assume the private context pointer
  will be something other than NULL in this callback.
  <LI>xxfi_close is called on close even if the previous mail
  transaction was aborted.
+<LI>xxfi_close is not necessary called after SIGHUP, SIGTERM or SIGINT are sent
+to the signal handler thread established by <A
+href="smfi_main.html">smfi_main</A>, for each call of xxfi_negotiate before
+smfi_main returns.  For this to happen the underlying select/poll calls need to exit.  
+On some UNIX the poll/select calls exit when libmilter closes the connection
+to the MTA, on Linux select/poll exit when the timeout set by <A
+href="smfi_settimeout.html">smfi_settimeout</A> expires.
  <LI>xxfi_close is responsible for freeing any resources allocated on a
  per-connection basis.
  <LI>Since the connection is already closing, the return value is

and I do not find an answer in my archives.

If I remember correctly on Ctrl+C poll/select() does not return, but waits for a timeout, and xxfi_close() is called only after poll/select() returns.

dilyanpalauzov commented 5 years ago

Compile m1.c with -lpthread -lmilter and start it in a terminal. In another terminal start miltertest by passing this script:

conn = mt.connect("/tmp/zz", 40, 0.25)
mt.conninfo(conn, "aaa.bbb.ccc", "1.2.3.4")

The second terminal prints "miltertest: select(): timeout on fd 3", the first terminal prints " NEGOTIATE NEG 1 NEG 2 ". If now Ctrl+C is pressed in the first terminal, CLOSE is not printed, therefore xxfi_close is not called for each xxfi_negotiate. If Ctrl+C is not pressed CLOSE is printed.

dilyanpalauzov commented 5 years ago

I was told the documentation of libmilter will be updated as follows:

diff --git a/libmilter/docs/overview.html b/libmilter/docs/overview.html
index 78676d3..ff0a69f 100644
--- a/libmilter/docs/overview.html
+++ b/libmilter/docs/overview.html
@@ -203,11 +203,21 @@ communication with the MTA happens.
 Filters are not terminated asynchronously
 (except by signals that can't be caught).
 In the case of <TT>Abort</TT> the
-<A HREF="xxfi_abort.html">xxfi_abort</A> callback is invoked.
+<A HREF="xxfi_abort.html">xxfi_abort</A> callback is usually invoked if there is an active transaction.  However, if an invoked callback takes too long to execute (the maximum time <TT>Abort</TT> waits is currently 5s)
+<!-- XREF: MI_CHK_TIME -->
+then the filter is simply terminated, i.e.,
+neither the
+<A HREF="xxfi_abort.html">xxfi_abort</A> callback
+nor the
+<A HREF="xxfi_close.html">xxfi_close</A> callback
+is invoked.

 <HR size="1">
 <FONT size="-1">
-Copyright (c) 2000, 2001, 2003, 2006 Proofpoint, Inc. and its suppliers.
+Copyright (c) 2000, 2001, 2003, 2006, 2018 Proofpoint, Inc. and its suppliers.
 All rights reserved.
 <BR>
 By using this file, you agree to the terms and conditions set
mskucherawy commented 5 years ago

Wow, that's been waiting 12 years to sting. That means we can leak state because we don't know the session referencing it went away.

Then I guess you're right, we have to stop asserting that this is a closed system.

dilyanpalauzov commented 5 years ago

In fact conf_refcnt should be removed from everywhere, not only from the assert.

glts commented 4 years ago

I’m running into this with the current 2.11.0-Beta2 tarball. Frequently I’m getting a core dump on shutdown:

opendkim: opendkim.c:5905: dkimf_config_free: Assertion `conf->conf_refcnt == 0' failed.
Aborted

This happens both with Ctrl-C as well as when stopped via systemd (eg systemctl restart).

dilyanpalauzov commented 4 years ago

Yes, remove all ref_cnt and recompile, or use https://mail.aegee.org/cgit/OpenDKIM/