haraka / Haraka

A fast, highly extensible, and event driven SMTP server
https://haraka.github.io
MIT License
5.08k stars 661 forks source link

Haraka Shutting Down #2438

Closed kkr78 closed 6 months ago

kkr78 commented 6 years ago

system info

Haraka Haraka.js — Version: 2.8.18
Node v9.2.1
OS Linux ip-20-4-xx-xx 4.9.62-21.56.amzn1.x86_64 #1 SMP Thu Nov 16 05:37:08 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
openssl OpenSSL 1.0.2k-fips 26 Jan 2017

Observed behavior

2018-05-29T11:16:09.767Z [NOTICE] [74E36277-C785-482D-989B-00A1CD3826DF.16] [core] queue code=DENYSOFT msg="4.4.1 Connection timed out. Total session duration: 00:20:05.3286439 [CWLGBR01FT006.eop-gbr01.prod.protection.outlook.com] (74E36277-C785-482D-989B-00A1CD3826DF.16)"
2018-05-29T11:16:09.781Z [NOTICE] [74E36277-C785-482D-989B-00A1CD3826DF.16] [core] disconnect ip=20.4.xx.xx rdns=NXDOMAIN helo=[127.0.0.1] relay=N early=N esmtp=Y tls=N pipe=N errors=0 txns=16 rcpts=16/0/0 msgs=15/1/0 bytes=1631558 lr="450 4.4.1 Connection timed out. Total session duration: 00:20:05.3286439 [CWLGBR01FT006.eop-gbr01.prod.protection.outlook.com] (74E36277-C785-482D-989B-00A1CD3826DF.16)" time=43.986

2018-05-29T11:16:17.327Z [CRIT] [-] [core] Error: read ECONNRESET
2018-05-29T11:16:17.327Z [CRIT] [-] [core]     at _errnoException (util.js:1031:13)
2018-05-29T11:16:17.327Z [CRIT] [-] [core]     at TCP.onread (net.js:619:25)
2018-05-29T11:16:17.328Z [NOTICE] [-] [core] Shutting down

This issue occurred only two or three times after upgraded to 2.8.18 - one last week and other today. I was not able to capture more traces. is there any way to identify based on the above error stack? BTW, Office 365 closes the connection every 20 mins so it's pretty common to see "DENYSOFT" and followed by "disconnect" in the logs very often. Not sure if that's related.

kkr78 commented 6 years ago

Here are some additional details in case if it helps to find the root cause

1) Haraka primarily configured to forward the emails to office 365. But for a small number of emails, I set connection.relay=true based on the sender which delivers mail using outbound MX based delivery.

#smtp_forward.ini

enable_outbound=false

[xx.net]
host=xx.mail.protection.outlook.com
port=25
enable_tls=true
timeout=300
connect_timeout=320

2) Find the errors below logged before one hour of the crash. These errors occurred other times as well so may not be related.

2018-05-29T10:36:41.935Z [ERROR] [502FF71D-0ECB-489C-8BB6-5F9F6B60F8C3.1.1] [outbound] Ongoing connection failed to 165.251.51.28:25 : Error: socket hang up
2018-05-29T10:36:42.412Z [ERROR] [502FF71D-0ECB-489C-8BB6-5F9F6B60F8C3.1.1] [outbound] Ongoing connection failed to 165.251.51.29:25 : Error: socket hang up
2018-05-29T10:37:12.414Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.202.133.35:25
2018-05-29T10:37:42.417Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.202.133.42:25
2018-05-29T10:38:12.419Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.202.133.36:25
2018-05-29T10:39:15.145Z [ERROR] [377D3C55-3839-4D83-9FF7-6041ADFE0246.8] [queue/smtp_forward] dead sender
2018-05-29T10:40:44.387Z [ERROR] [91A45D89-7FEB-40E4-8FCE-46214C4C62D7.1.1] [outbound] Ongoing connection failed to 194.214.201.8:25 : Error: 139744362600256:error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure:../deps/openssl/openssl/ssl/s23_clnt.c:802:
2018-05-29T10:40:44.912Z [ERROR] [91A45D89-7FEB-40E4-8FCE-46214C4C62D7.1.1] [outbound] Ongoing connection failed to 194.214.200.9:25 : Error: 139744362600256:error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure:../deps/openssl/openssl/ssl/s23_clnt.c:802:
2018-05-29T10:40:45.435Z [ERROR] [91A45D89-7FEB-40E4-8FCE-46214C4C62D7.1.1] [outbound] Ongoing connection failed to 194.214.201.9:25 : Error: 139744362600256:error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure:../deps/openssl/openssl/ssl/s23_clnt.c:802:
2018-05-29T10:40:45.960Z [ERROR] [91A45D89-7FEB-40E4-8FCE-46214C4C62D7.1.1] [outbound] Ongoing connection failed to 194.214.200.8:25 : Error: 139744362600256:error:14077410:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure:../deps/openssl/openssl/ssl/s23_clnt.c:802:
2018-05-29T10:50:39.880Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 151.193.65.92:25
2018-05-29T10:55:17.379Z [ERROR] [502FF71D-0ECB-489C-8BB6-5F9F6B60F8C3.1.1] [outbound] Ongoing connection failed to 165.251.51.28:25 : Error: socket hang up
2018-05-29T10:55:17.857Z [ERROR] [502FF71D-0ECB-489C-8BB6-5F9F6B60F8C3.1.1] [outbound] Ongoing connection failed to 165.251.51.29:25 : Error: socket hang up
2018-05-29T10:55:47.859Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.202.133.35:25
2018-05-29T10:56:17.894Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.202.133.36:25
2018-05-29T10:56:47.926Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.202.133.42:25
2018-05-29T11:10:59.275Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 37.18.145.120:25
2018-05-29T11:15:24.530Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.214.200.8:25
2018-05-29T11:15:54.619Z [ERROR] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 194.214.200.9:25
celesteking commented 6 years ago

Hi. Thanks for the report. Are you running from master branch? Could #2375 be related?

kkr78 commented 6 years ago

I installed using npm ( npm install -g Haraka@2.8.18)

2375 is different as I don't see hmail traces in the stack and error messages not matching other than the last line with the message Shutting Down.

celesteking commented 6 years ago

Could you check out from master branch and see if it solves the issue? Also, it would be wonderful if you run with DEBUG log level as that error backtrace is not helpful at all. What plugins are you running? Do you have any custom stuff?

kkr78 commented 6 years ago

I will definitely try. Can I simply download and override the Haraka installed on the box or do I need to follow any process to build and install it?

Enabling it on Debug level is challenging as we need to run it for at least week to reproduce it, it will fill up the logs as we deal with huge traffic. If the issue still persists after the update then I will try to enable at DEBUG and provide the corresponding logs.

Yes, I have a custom rcpt plugin decides whether to use the outbound (MX Based delivery) or smtp_forward (forward to Office 365 SMTP endpoint). Only 3 to 5% uses outbound.

exports.hook_rcpt = function (next, connection, params) {
    var rcpt_to=connection.transaction.rcpt_to
    var isRelay = isRelayRequired(rcpt_to)
    if(!isRelay){
       connection.relaying=false
       next(OK);
    }else{
       connection.relaying=true;
       next(OK);
    }
}
pvagner commented 6 years ago

Hello,

I am also getting this very very randomly. My instance only accepts mail on port 587 submission via TLS, then verifies users using haraka-plugin-ldap also using rspamd and relaying to the outside world. Apparently some kind of malicious botnet was hammering this instance at port 587 and it was able to somehow break the connection in an unexpected way like this. I am sorry this is only log level info. To me this looks as if nothing serious has happened during the SMTP conversations and perhaps this exception should get caught and logged rather causing haraka to exit. Or is there something else that might go wrong?

[NOTICE] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] connect ip=103.89.89.236 
port=10491 local_ip=:: local_port=587
[INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [tls] secured: cipher=AES256-SHA v
ersion=TLSv1/SSLv3 verified=false
[INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core]  hook=unrecognized_command 
plugin=tls function=upgrade_connection params=STARTTLS retval=OK msg=""
[INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core]  hook=unrecognized_command 
plugin=ldap function=handle_authn params=AUTH retval=OK msg=""
[INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core]  hook=unrecognized_command 
plugin=ldap function=handle_authn params="c2FtcGxlQHB2YWduZXIudGs=" retval=OK ms
g=""
[INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] client [103.89.89.236] half
 closed connection
[NOTICE] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] disconnect ip=103.89.89.2
36 rdns=NXDOMAIN helo=win-67fnmqrvgi3.domain relay=N early=N esmtp=Y tls=Y pipe=
N errors=0 txns=0 rcpts=0/0/0 msgs=0/0/0 bytes=0 lr="" time=20.179
[NOTICE] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [core] connect ip=103.89.89.236 
port=11876 local_ip=:: local_port=587
[INFO] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [tls] secured: cipher=AES256-SHA v
ersion=TLSv1/SSLv3 verified=false
[INFO] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [core]  hook=unrecognized_command 
plugin=tls function=upgrade_connection params=STARTTLS retval=OK msg=""
[INFO] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [core]  hook=unrecognized_command 
plugin=ldap function=handle_authn params=AUTH retval=OK msg=""
[CRIT] [-] [core] Error: read ECONNRESET
[CRIT] [-] [core]     at TCP.onread (net.js:660:25)
[NOTICE] [-] [core] Shutting down
baudehlo commented 6 years ago

It's definitely a bug. The problem is the error handler isn't created soon enough. tls_socket.js:createServer needs an initial .on(error) added to the cryptoSocket returned immediately:

const server = net.createServer(cryptoSocket => {
    cryptoSocket.on('error', (err) => ... )

On Thu, Sep 6, 2018 at 4:02 AM pvagner notifications@github.com wrote:

Hello,

I am also getting this very very randomly. My instance only accepts mail on port 587 submission via TLS, then verifies users using haraka-plugin-ldap also using rspamd and relaying to the outside world. Apparently some kind of malicious botnet was hammering this instance at port 587 and it was able to somehow break the connection in an unexpected way like this. I am sorry this is only log level info. To me this looks as if nothing serious has happened during the SMTP conversations and perhaps this exception should get caught and logged rather causing haraka to exit. Or is there something else that might go wrong?

[NOTICE] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] connect ip=103.89.89.236 port=10491 local_ip=:: local_port=587 [INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [tls] secured: cipher=AES256-SHA v ersion=TLSv1/SSLv3 verified=false [INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] hook=unrecognized_command plugin=tls function=upgrade_connection params=STARTTLS retval=OK msg="" [INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] hook=unrecognized_command plugin=ldap function=handle_authn params=AUTH retval=OK msg="" [INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] hook=unrecognized_command plugin=ldap function=handle_authn params="c2FtcGxlQHB2YWduZXIudGs=" retval=OK ms g="" [INFO] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] client [103.89.89.236] half closed connection [NOTICE] [0A8C8D6E-61E5-4470-85E3-F44C0A1D9FF2] [core] disconnect ip=103.89.89.2 36 rdns=NXDOMAIN helo=win-67fnmqrvgi3.domain relay=N early=N esmtp=Y tls=Y pipe= N errors=0 txns=0 rcpts=0/0/0 msgs=0/0/0 bytes=0 lr="" time=20.179 [NOTICE] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [core] connect ip=103.89.89.236 port=11876 local_ip=:: local_port=587 [INFO] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [tls] secured: cipher=AES256-SHA v ersion=TLSv1/SSLv3 verified=false [INFO] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [core] hook=unrecognized_command plugin=tls function=upgrade_connection params=STARTTLS retval=OK msg="" [INFO] [F7AE40A8-3EBC-400B-9110-8E763FDF2B5D] [core] hook=unrecognized_command plugin=ldap function=handle_authn params=AUTH retval=OK msg="" [CRIT] [-] [core] Error: read ECONNRESET [CRIT] [-] [core] at TCP.onread (net.js:660:25) [NOTICE] [-] [core] Shutting down

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haraka/Haraka/issues/2438#issuecomment-419002616, or mute the thread https://github.com/notifications/unsubscribe-auth/AAobY_O4w9TSIG97UN8fOvX50dXX-yqdks5uYNZ9gaJpZM4URaO7 .

kkr78 commented 6 years ago

I still see this issue with Haraka 2.8.21. This is the only issue that I have with Haraka. Otherwise its running smoothly in production. How can we catch these type of errors with out crashing? Please find the log with Debug below

2018-09-07T15:53:16.198Z [INFO] [-] [core] [outbound] [outbound::25:185.15.50.11:undefined:50] dispense() clients=1 available=0
2018-09-07T15:53:16.198Z [DEBUG] [-] [core] [outbound] [outbound::25:185.15.50.11:undefined:50] createResource() - creating obj - count=1 min=0 max=10
2018-09-07T15:53:16.199Z [DEBUG] [23A29ACF-535D-439B-93AE-F1411A78F77C] [core] [outbound] created
2018-09-07T15:53:20.753Z [INFO] [DA135BB2-AF3B-409A-B933-C5A44976D0CC.1] [core] client [20.4.1.179] half closed connection
2018-09-07T15:53:20.753Z [DEBUG] [DA135BB2-AF3B-409A-B933-C5A44976D0CC.1] [core] client has disconnected
2018-09-07T15:53:20.753Z [DEBUG] [DA135BB2-AF3B-409A-B933-C5A44976D0CC.1] [core] running disconnect hooks
2018-09-07T15:53:20.753Z [DEBUG] [DA135BB2-AF3B-409A-B933-C5A44976D0CC.1] [core] client has disconnected
2018-09-07T15:53:20.753Z [DEBUG] [DA135BB2-AF3B-409A-B933-C5A44976D0CC.1] [core] client has disconnected
2018-09-07T15:53:20.753Z [NOTICE] [DA135BB2-AF3B-409A-B933-C5A44976D0CC.1] [core] disconnect ip=20.4.1.179 rdns=NXDOMAIN helo=[127.0.0.1] relay=N early=N esmtp=Y tls=N pipe=N errors=0 txns=1 rcpts=1/0/0 msgs=1/0/0 bytes=7864 lr="" time=60.028
2018-09-07T15:53:44.155Z [INFO] [-] [core] [outbound] [outbound::25:194.202.133.35:undefined:50] dispense() clients=0 available=0
2018-09-07T15:53:46.199Z [NOTICE] [-] [core] [outbound] Failed to get pool entry: Outbound connection timed out to 185.15.50.11:25
2018-09-07T15:53:46.199Z [INFO] [4ECA8AF9-B4B4-4565-A7D3-8E83CA5F01DA.1.1] [outbound] Looking up A records for: b.mx.p50.neolane.net
2018-09-07T15:53:46.199Z [INFO] [-] [core] [outbound] [outbound::25:185.15.50.11:undefined:50] dispense() clients=0 available=0
2018-09-07T15:53:46.220Z [INFO] [4ECA8AF9-B4B4-4565-A7D3-8E83CA5F01DA.1.1] [outbound] Attempting to deliver to: 185.15.50.12:25 (0) (57)
2018-09-07T15:53:46.220Z [INFO] [-] [core] [outbound] [outbound::25:185.15.50.12:undefined:50] dispense() clients=1 available=0
2018-09-07T15:53:46.220Z [DEBUG] [-] [core] [outbound] [outbound::25:185.15.50.12:undefined:50] createResource() - creating obj - count=1 min=0 max=10
2018-09-07T15:53:46.220Z [DEBUG] [D89E07D4-BFC8-4E24-B3C5-CB38C7540AB6] [core] [outbound] created
2018-09-07T15:53:46.303Z [INFO] [-] [core] [outbound] acquired socket D89E07D4-BFC8-4E24-B3C5-CB38C7540AB6 for outbound::25:185.15.50.12:undefined:50
2018-09-07T15:53:46.957Z [NOTICE] [4ECA8AF9-B4B4-4565-A7D3-8E83CA5F01DA.1.1] [outbound]  delivered file=1536335596066_1536335596066_0_21752_UWDuv9_9720_ip-20-4-11-95 domain=reservation.accor-mail.com host=b.mx.p50.neolane.net ip=185.15.50.12 port=25 mode=SMTP tls=N auth=N response="Ok: queued as 94F101AE2B23" delay=30.891 fails=0 rcpts=1/0/0
2018-09-07T15:53:46.957Z [DEBUG] [4ECA8AF9-B4B4-4565-A7D3-8E83CA5F01DA.1.1] [outbound] running delivered hooks
2018-09-07T15:53:47.040Z [DEBUG] [-] [core] [outbound] release_client: D89E07D4-BFC8-4E24-B3C5-CB38C7540AB6 185.15.50.12:25 to undefined
2018-09-07T15:53:47.040Z [INFO] [-] [core] [outbound] [outbound::25:185.15.50.12:undefined:50] dispense() clients=0 available=1
2018-09-07T15:53:47.822Z [INFO] [-] [core] [outbound] [outbound::25:185.15.50.11:undefined:50] dispense() clients=0 available=1
2018-09-07T15:54:06.510Z [CRIT] [-] [core] Error: read ECONNRESET
2018-09-07T15:54:06.510Z [CRIT] [-] [core]     at _errnoException (util.js:1031:13)
2018-09-07T15:54:06.510Z [CRIT] [-] [core]     at TCP.onread (net.js:619:25)
2018-09-07T15:54:06.511Z [NOTICE] [-] [core] Shutting down
msimerson commented 5 years ago

You could try the patch in #2501

msimerson commented 6 months ago

No activity and no reports for 5 years. A fair number of changes (esp the discarding of the pooling code) have happened since and I'm thinking this may be a solved problem. Please reopen if this error can be replicated on a modern version of Haraka.