Open dilyanpalauzov opened 6 years ago
The assertion in there is to detect and report conditions where we call dkimf_config_free() improperly, so I'd prefer to leave it in if possible and fix the underlying problem.
Off the top of my head, we could try moving the refcount increment to mlfi_connect(). It looks like it's possible to make this safe by having mlfi_negotiate() use curconf, and only actually copy a reference to that object and increment the refcount in mlfi_connect(). Does this make sense?
The problem happens on Ctrl+C: the xxfi_close() is not called for existing connections, the counter does not decrease and during shutdown opendkim crashes as the assert fails.
It does not matter if the increment is done is xxfi_connect() or xxfi_negotiate(), the trouble comes from skipping the decrement in xxfi_close().
I'm trying to understand how the call to mlfi_close() could get skipped. In libmilter, every thread calls xxfi_close() on shutdown if one is set, unless the thread is already in ST_QUIT state. Their documentation also says: "xxfi_close is always called once at the end of each connection."
...so it should never be skipped. We will only skip the decrement intentionally in mlfi_close() if the private per-connection status pointer is NULL, but if that's NULL then presumably the decrement already happened (or the increment never did).
Still looking.
On 30 Oct 2017 I proposed per email to Sendmail to update the documentation as follows:
diff --git a/docs/xxfi_close.html b/docs/xxfi_close.html
--- a/docs/xxfi_close.html
+++ b/docs/xxfi_close.html
@@ -60,6 +60,13 @@ In particular, it is incorrect to assume the private context pointer
will be something other than NULL in this callback.
<LI>xxfi_close is called on close even if the previous mail
transaction was aborted.
+<LI>xxfi_close is not necessary called after SIGHUP, SIGTERM or SIGINT are sent
+to the signal handler thread established by <A
+href="smfi_main.html">smfi_main</A>, for each call of xxfi_negotiate before
+smfi_main returns. For this to happen the underlying select/poll calls need to exit.
+On some UNIX the poll/select calls exit when libmilter closes the connection
+to the MTA, on Linux select/poll exit when the timeout set by <A
+href="smfi_settimeout.html">smfi_settimeout</A> expires.
<LI>xxfi_close is responsible for freeing any resources allocated on a
per-connection basis.
<LI>Since the connection is already closing, the return value is
and I do not find an answer in my archives.
If I remember correctly on Ctrl+C poll/select() does not return, but waits for a timeout, and xxfi_close() is called only after poll/select() returns.
Compile m1.c with -lpthread -lmilter and start it in a terminal. In another terminal start miltertest by passing this script:
conn = mt.connect("/tmp/zz", 40, 0.25)
mt.conninfo(conn, "aaa.bbb.ccc", "1.2.3.4")
The second terminal prints "miltertest: select(): timeout on fd 3", the first terminal prints " NEGOTIATE NEG 1 NEG 2 ". If now Ctrl+C is pressed in the first terminal, CLOSE is not printed, therefore xxfi_close is not called for each xxfi_negotiate. If Ctrl+C is not pressed CLOSE is printed.
I was told the documentation of libmilter will be updated as follows:
diff --git a/libmilter/docs/overview.html b/libmilter/docs/overview.html
index 78676d3..ff0a69f 100644
--- a/libmilter/docs/overview.html
+++ b/libmilter/docs/overview.html
@@ -203,11 +203,21 @@ communication with the MTA happens.
Filters are not terminated asynchronously
(except by signals that can't be caught).
In the case of <TT>Abort</TT> the
-<A HREF="xxfi_abort.html">xxfi_abort</A> callback is invoked.
+<A HREF="xxfi_abort.html">xxfi_abort</A> callback is usually invoked if there is an active transaction. However, if an invoked callback takes too long to execute (the maximum time <TT>Abort</TT> waits is currently 5s)
+<!-- XREF: MI_CHK_TIME -->
+then the filter is simply terminated, i.e.,
+neither the
+<A HREF="xxfi_abort.html">xxfi_abort</A> callback
+nor the
+<A HREF="xxfi_close.html">xxfi_close</A> callback
+is invoked.
<HR size="1">
<FONT size="-1">
-Copyright (c) 2000, 2001, 2003, 2006 Proofpoint, Inc. and its suppliers.
+Copyright (c) 2000, 2001, 2003, 2006, 2018 Proofpoint, Inc. and its suppliers.
All rights reserved.
<BR>
By using this file, you agree to the terms and conditions set
Wow, that's been waiting 12 years to sting. That means we can leak state because we don't know the session referencing it went away.
Then I guess you're right, we have to stop asserting that this is a closed system.
In fact conf_refcnt should be removed from everywhere, not only from the assert.
I’m running into this with the current 2.11.0-Beta2 tarball. Frequently I’m getting a core dump on shutdown:
opendkim: opendkim.c:5905: dkimf_config_free: Assertion `conf->conf_refcnt == 0' failed.
Aborted
This happens both with Ctrl-C as well as when stopped via systemd (eg systemctl restart
).
Yes, remove all ref_cnt and recompile, or use https://mail.aegee.org/cgit/OpenDKIM/
xxfi_close() is not necessary called, when SIGINT, SIGHUP or SIGTERM are sent to the signal handler thread established by smfi_main(), before smfi_main() returns. For this to happen the underlaying select/poll calls need to return. On some UNIX the poll/select calls exist, when libmilter closes the connection with the MTA, on Linux select/poll exit, when the timeout set by smfi_settimeout() expires.
In turn, when SIGTERM is sent to opendkim, the opened connections are not xxfi_closed() for each previously established xxfi_negotiate(), the reference counter is not decreased to zero and the assert fails.