trusteddomainproject / OpenDKIM

Other
93 stars 51 forks source link

Large number of file descriptors used in RHEL 9 #198

Open razorbladex401 opened 7 months ago

razorbladex401 commented 7 months ago

Currently experiencing an interesting issue where opendkim is exhausting all its file descriptors leading to postfix milter reject errors. This recently started happening after having the systems running for several months in RHEL 9. From what I've been able to tell there seems to be some issue with opendkim when on a system that's using sssd 2.9.1-4.el9_3.1 or greater. I have already opened a case with Red Hat but since opendkim is not part of their repo's they can't say if it's a bug with sssd or opendkim.

For some background: We're using several domains to send from with around 50+ keys. Using, ss -anp | grep 'opendkim' | sed 's/ *$//', we can see streams grow after each email that get's signed. Below is a small example. Generally, we only really see maybe 11 streams at a time and they don't stay open long.

` u_str ESTAB 0 0 53817391 0 users:(("opendkim",pid=1856103,fd=77)) u_str ESTAB 0 0 53832275 0 users:(("opendkim",pid=1856103,fd=218)) u_str ESTAB 0 0 53824876 0 users:(("opendkim",pid=1856103,fd=124)) u_str ESTAB 0 0 53824756 0 users:(("opendkim",pid=1856103,fd=133)) u_str ESTAB 0 0 53836020 0 users:(("opendkim",pid=1856103,fd=255)) u_str ESTAB 0 0 53838661 0 users:(("opendkim",pid=1856103,fd=285)) u_str ESTAB 0 0 53808986 0 users:(("opendkim",pid=1856103,fd=30)) u_str ESTAB 0 0 53835631 0 users:(("opendkim",pid=1856103,fd=270)) u_str ESTAB 0 0 53832887 0 users:(("opendkim",pid=1856103,fd=245)) u_str ESTAB 0 0 53838157 0 users:(("opendkim",pid=1856103,fd=283)) u_str ESTAB 0 0 53840434 0 users:(("opendkim",pid=1856103,fd=305)) u_str ESTAB 0 0 53830151 0 users:(("opendkim",pid=1856103,fd=182)) u_str ESTAB 0 0 53830225 0 users:(("opendkim",pid=1856103,fd=187)) u_str ESTAB 0 0 53830723 0 users:(("opendkim",pid=1856103,fd=191)) u_str ESTAB 0 0 53828693 0 users:(("opendkim",pid=1856103,fd=176)) u_str ESTAB 0 0 53827800 0 users:(("opendkim",pid=1856103,fd=171)) u_str ESTAB 0 0 53825776 0 users:(("opendkim",pid=1856103,fd=146)) u_str ESTAB 0 0 53834975 0 users:(("opendkim",pid=1856103,fd=259)) u_str ESTAB 0 0 53808916 0 users:(("opendkim",pid=1856103,fd=23)) u_str ESTAB 0 0 53810481 0 users:(("opendkim",pid=1856103,fd=33)) u_str ESTAB 0 0 53832456 0 users:(("opendkim",pid=1856103,fd=234)) u_str ESTAB 0 0 53810634 0 users:(("opendkim",pid=1856103,fd=47)) u_str ESTAB 0 0 53835245 0 users:(("opendkim",pid=1856103,fd=260)) u_str ESTAB 0 0 53836391 0 users:(("opendkim",pid=1856103,fd=265)) u_str ESTAB 0 0 53827932 0 users:(("opendkim",pid=1856103,fd=175)) u_str ESTAB 0 0 53826487 0 users:(("opendkim",pid=1856103,fd=163)) tcp LISTEN 0 4096 127.0.0.1:8891 0.0.0.0:* users:(("opendkim",pid=1856103,fd=3))

`

After some time there will be several hundred (if not more) open streams. Eventually it'll grow and we'll start seeing milter-rejects in postfix logs. We can get things working again by restarting opendkim.

warning: milter inet:127.0.0.1:8891: can't read SMFIC_OPTNEG reply packet header: Connection reset by peer warning: milter inet:127.0.0.1:8891: read error in initial handshake NOQUEUE: milter-reject: CONNECT from server [IP]: 451 4.7.1 Service unavailable - try again later; proto=SMTP

This only started after we updated sssd to version 2.9.1-4.el9_3.1. When we downgrade to 2.9.1-4.el9_3 it doesn't happen. Like stated before Redhat is not able to confirm if this is a bug with sssd or opendkim since opendkim is not from their repos. We use the version from EPEL.

There is a similar post in the Rocky Linux forums with some additional information that seems to match the same issue we are seeing, https://forums.rockylinux.org/t/rocky-9-3-opendkim-latest-sssd-packages/12327.

I believe this is tied to nss. When enabling debug logging with sssd I can see that opendkim shows up as a client.

(2024-01-31 10:16:01): [nss] [accept_fd_handler] (0x0400): [CID#3] Client [cmd /usr/sbin/opendkim][uid 34188][0x562038d3c500][23] connected! (2024-01-31 10:16:01): [nss] [sss_cmd_get_version] (0x0200): [CID#3] Received client version [1]. (2024-01-31 10:16:01): [nss] [sss_cmd_get_version] (0x0200): [CID#3] Offered version [1]. (2024-01-31 10:16:01): [nss] [cache_req_set_plugin] (0x2000): [CID#3] CR #2: Setting "Enumerate users" plugin (2024-01-31 10:16:01): [nss] [cache_req_send] (0x0400): [CID#3] CR #2: REQ_TRACE: New request [CID #3] 'Enumerate users' (2024-01-31 10:16:01): [nss] [cache_req_select_domains] (0x0400): [CID#3] CR #2: Performing a multi-domain search (2024-01-31 10:16:01): [nss] [cache_req_search_domains] (0x0400): [CID#3] CR #2: Search will bypass the cache and check the data provider (2024-01-31 10:16:01): [nss] [cache_req_validate_domain_enumeration] (0x0400): [CID#3] CR #2: Domain mydomain.com does not support enumeration, skipping... (2024-01-31 10:16:01): [nss] [cache_req_validate_domain_enumeration] (0x0400): [CID#3] CR #2: Enumeration requested but not enabled (2024-01-31 10:16:01): [nss] [cache_req_global_ncache_add] (0x2000): [CID#3] CR #2: This request type does not support global negative cache (2024-01-31 10:16:01): [nss] [cache_req_process_result] (0x0400): [CID#3] CR #2: Finished: Not found (2024-01-31 10:16:01): [nss] [sss_nss_protocol_done] (0x4000): [CID#3] Sending reply: success (2024-01-31 10:16:01): [nss] [cache_req_set_plugin] (0x2000): [CID#3] CR #3: Setting "Enumerate users" plugin (2024-01-31 10:16:01): [nss] [cache_req_send] (0x0400): [CID#3] CR #3: REQ_TRACE: New request [CID #3] 'Enumerate users' (2024-01-31 10:16:01): [nss] [cache_req_select_domains] (0x0400): [CID#3] CR #3: Performing a multi-domain search (2024-01-31 10:16:01): [nss] [cache_req_search_domains] (0x0400): [CID#3] CR #3: Search will bypass the cache and check the data provider (2024-01-31 10:16:01): [nss] [cache_req_validate_domain_enumeration] (0x0400): [CID#3] CR #3: Domain mydomain.com does not support enumeration, skipping... (2024-01-31 10:16:01): [nss] [cache_req_global_ncache_add] (0x2000): [CID#3] CR #3: This request type does not support global negative cache (2024-01-31 10:16:01): [nss] [cache_req_process_result] (0x0400): [CID#3] CR #3: Finished: Not found (2024-01-31 10:16:01): [nss] [sss_nss_protocol_done] (0x4000): [CID#3] Sending reply: not found (2024-01-31 10:16:01): [nss] [sss_nss_endent] (0x0100): [CID#3] Resetting enumeration state (2024-01-31 10:16:01): [nss] [sss_nss_protocol_done] (0x4000): [CID#3] Sending reply: success (2024-01-31 10:16:02): [nss] [get_client_cred] (0x4000): Client [0x562038cec2c0][24] creds: euid[34188] egid[34188] pid[1884642] cmd_line['/usr/sbin/opendkim']. (2024-01-31 10:16:02): [nss] [setup_client_idle_timer] (0x4000): Idle timer re-set for client [0x562038cec2c0][24]

kivengh commented 3 weeks ago

Hi, I have similar issue on Amazon Linux 2023 and SSSD 2.9.1 abd 2.9.4. I've got to downgrade to SSSD 2.5.0 to get it works fine.