haraka / Haraka

A fast, highly extensible, and event driven SMTP server
https://haraka.github.io
MIT License
5.02k stars 662 forks source link

Haraka crashes when handling bounces in v2.2.1 #370

Closed krootee closed 10 years ago

krootee commented 10 years ago

It seems that bounce processing is broken in Haraka since version 2.1.0 (or so) when using custom bounce processing plugin - we just did upgrade to latest version from 2.1.0 (since its impossible to install it anymore - broken iconv dependency). Haraka fails with following error: [2013-10-01 09:24:27] [ERROR] [4AFE87CB-BA59-41D2-AD73-751C8303A036.12] [outbound] Double bounce: Error populating bounce message: Error: ENOENT, open '/var/haraka/outbound/queue/1380618904311_0_13792_3358.mailprocessing-server2' [2013-10-01 09:24:27] [CRIT] [-] [core] Error: Callback was already called. [2013-10-01 09:24:27] [CRIT] [-] [core] at /usr/lib/node_modules/Haraka/node_modules/async/lib/async.js:22:31 [2013-10-01 09:24:27] [CRIT] [-] [core] at HMailItem.hmail.next_cb (/usr/lib/node_modules/Haraka/outbound.js:51:9) [2013-10-01 09:24:27] [CRIT] [-] [core] at HMailItem.double_bounce (/usr/lib/node_modules/Haraka/outbound.js:1207:10) [2013-10-01 09:24:27] [CRIT] [-] [core] at /usr/lib/node_modules/Haraka/outbound.js:1191:25 [2013-10-01 09:24:27] [CRIT] [-] [core] at ReadStream. (/usr/lib/node_modules/Haraka/outbound.js:1145:9) [2013-10-01 09:24:27] [CRIT] [-] [core] at ReadStream.EventEmitter.emit (events.js:95:17) [2013-10-01 09:24:27] [CRIT] [-] [core] at fs.js:1500:12 [2013-10-01 09:24:27] [CRIT] [-] [core] at Object.oncomplete (fs.js:107:15) [2013-10-01 09:24:27] [NOTICE] [-] [core] Shutting down

We have custom bounce processing plugin that registers such failures in central system and then returns from plugin STOP. But code in outbound.js seems to have missing return when calling "this.discard()": .... HMailItem.prototype.bounce_respond = function (retval, msg) { if (retval != constants.cont) { this.loginfo("plugin responded with: " + retval + ". Not sending bounce."); if (retval === constants.stop) { this.discard(); // calls next_cb } else { return this.next_cb(); } } ...

krootee commented 10 years ago

Changing to "return this.discard()" helps, but now we need to use different version of "discard", since current version tries to call next_cb which is not required and if done causes internal plugin callback loop.

krootee commented 10 years ago

More crashes were observed - after investigating code i think problem is inside "discard" function itself, since looking at it's code usage in outbound.js it should not call "this.next_cb()" inside of it. outbound.js: HMailItem.prototype.discard = function () { this.refcount--; if (this.refcount === 0) { // Remove the file. fs.unlink(this.path, function () {}); this.next_cb(); } }

baudehlo commented 10 years ago

Hmm, I'll have to have a further look at this. Do you have a bit more of logs I can look at?

krootee commented 10 years ago

Not much logs now (made workaround and pushed to PROD), but it was reproducing easily with every bounce back - below is simple plugin that helps to reproduce the problem.

exports.hook_bounce = function (next, hmail) { return next(STOP); };

baudehlo commented 10 years ago

OK so with the above simple plugin, adding the "return" in fixes the problem.

I'm not sure what the other problem you're seeing is - the next_cb needs to be called or the queue will never properly flush.

krootee commented 10 years ago

I think you are right after checking logs. But leave next_cb() call in discard generates a lot of crashes with message like this: [2013-10-01 10:26:10] [CRIT] [-] [core] Error: Callback was already called. [2013-10-01 10:26:10] [CRIT] [-] [core] at /usr/lib/node_modules/Haraka/node_modules/async/lib/async.js:22:31 [2013-10-01 10:26:10] [CRIT] [-] [core] at HMailItem.hmail.next_cb (/usr/lib/node_modules/Haraka/outbound.js:51:9) [2013-10-01 10:26:10] [CRIT] [-] [core] at HMailItem.discard (/usr/lib/node_modules/Haraka/outbound.js:1230:14) [2013-10-01 10:26:10] [CRIT] [-] [core] at pluggableStream. (/usr/lib/node_modules/Haraka/outbound.js:1070:38) [2013-10-01 10:26:10] [CRIT] [-] [core] at pluggableStream.EventEmitter.emit (events.js:95:17) [2013-10-01 10:26:10] [CRIT] [-] [core] at pluggableStream.self.process_data (/usr/lib/node_modules/Haraka/line_socket.js:24:18) [2013-10-01 10:26:10] [CRIT] [-] [core] at pluggableStream. (/usr/lib/node_modules/Haraka/line_socket.js:34:44) [2013-10-01 10:26:10] [CRIT] [-] [core] at pluggableStream.EventEmitter.emit (events.js:95:17) [2013-10-01 10:26:10] [CRIT] [-] [core] at Socket. (/usr/lib/node_modules/Haraka/tls_socket.js:47:14) [2013-10-01 10:26:10] [CRIT] [-] [core] at Socket.EventEmitter.emit (events.js:95:17) [2013-10-01 10:26:10] [NOTICE] [-] [core] Shutting down

baudehlo commented 10 years ago

I need to see the lines above that

baudehlo commented 10 years ago

Specifically this looks like an error coming from populate_bounce_message ?

baudehlo commented 10 years ago

And your line numbers don't line up with Haraka 2.2.1 - have you modified your own outbound.js?

krootee commented 10 years ago

Probably it was already with our internal fixes - I will run clean version with "return" fix only and provide logs.

krootee commented 10 years ago

I ran server with "return" fix only for 12h under the load - no "next_cb()" crashes were registered. So probably it was one of our other changes that triggered crashes.

krootee commented 10 years ago

Checked again and found few crashes - it seems like edge case that is not handled inside Haraka. In outbound.js in "HMailItem.prototype.try_deliver_host" we set timeout to 15sec vs 300sec "socket.setTimeout(15 * 1000);" and that triggers problems sometimes (under the load having 300sec is not very good default setting) - you get timeout from the client connection, but after it you receive 550 from the server. If you set timeout to very small value then bug should be easy to reproduce.

Below is log with sensetive data removed from it: [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running mail hooks [2013-10-07 14:16:32] [NOTICE] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] sender code=CONT msg="" [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] S: 250 sender OK [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] C: RCPT TO: state=1 [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running rcpt hooks [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running rcpt_ok hooks [2013-10-07 14:16:32] [NOTICE] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] recipient code=CONT msg="" [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] S: 250 recipient OK [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] C: DATA state=1 [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running data hooks [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running data hook in process_title plugin [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] hook=data plugin=process_title function=hook_data params="" retval=CONT msg="" [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] S: 354 go ahead, make my day [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running data_post hooks [2013-10-07 14:16:32] [NOTICE] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] message mid="<ID>@<server-name>" size=9666 rcpts=1/0/0 delay=0 code=CONT msg="" [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running queue_outbound hooks [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running queue_outbound hook in dkim_sign plugin [2013-10-07 14:16:32] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [dkim_sign] [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] hook=queue_outbound plugin=dkim_sign function=hook_queue_outbound params="" retval=CONT msg="" [2013-10-07 14:16:32] [NOTICE] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] queue code=CONT msg="" [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running queue_ok hooks [2013-10-07 14:16:32] [NOTICE] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] queue code=CONT msg="Mail Queued" [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] S: 250 Mail Queued [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] running reset_transaction hooks [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [core] C: MAIL FROM: state=1 [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] running send_email hooks [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Sending mail: 1381155392594_0_29015_9411. [2013-10-07 14:16:32] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] running get_mx hooks [2013-10-07 14:16:32] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Looking up A records for: mx2..com [2013-10-07 14:16:32] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Attempting to deliver to: :25 (1) (12) [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] S: 220 MX2..com ESMTP MailMarshal (v6.4.6.5922) Ready\r\n [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] C: EHLO mail7. [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] S: 250-MX2..com Hello mail7. ()\r\n [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] S: 250 SIZE\r\n [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] C: MAIL FROM: [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] S: 250 sender ok \r\n [2013-10-07 14:16:32] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] C: RCPT TO: [2013-10-07 14:16:47] [ERROR] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Outbound connection timed out to :25 [2013-10-07 14:16:47] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Looking up A records for: mx1..com [2013-10-07 14:16:47] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Attempting to deliver to: :25 (0) (38) [2013-10-07 14:16:47] [ERROR] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Ongoing connection failed to :25 : Error: connect ECONNREFUSED [2013-10-07 14:16:47] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] Temp failing 1381155392594_0_29015_9411. for 1500 seconds: Tried all MXs [2013-10-07 14:16:49] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] S: 550 Internal problem\r\n [2013-10-07 14:16:49] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] bouncing mail: Some recipients failed: [2013-10-07 14:16:49] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] running bounce hooks [2013-10-07 14:16:49] [DEBUG] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] running bounce hook in email_delivery_status plugin [2013-10-07 14:16:49] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] hook=bounce plugin=email_delivery_status function=hook_bounce params="Some recipients failed: " retval=STOP msg="" [2013-10-07 14:16:49] [INFO] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] plugin responded with: 901. Not sending bounce. [2013-10-07 14:16:49] [PROTOCOL] [F0276E6F-2832-474B-A12B-518E5B46ECA2.15] [outbound] C: QUIT [2013-10-07 14:16:49] [CRIT] [-] [core] Error: Callback was already called. [2013-10-07 14:16:49] [CRIT] [-] [core] at /usr/lib/node_modules/Haraka/node_modules/async/lib/async.js:22:31 [2013-10-07 14:16:49] [CRIT] [-] [core] at HMailItem.hmail.next_cb (/usr/lib/node_modules/Haraka/outbound.js:51:9) [2013-10-07 14:16:49] [CRIT] [-] [core] at HMailItem.discard (/usr/lib/node_modules/Haraka/outbound.js:1230:14) [2013-10-07 14:16:49] [CRIT] [-] [core] at pluggableStream. (/usr/lib/node_modules/Haraka/outbound.js:1070:38) [2013-10-07 14:16:49] [CRIT] [-] [core] at pluggableStream.EventEmitter.emit (events.js:95:17) [2013-10-07 14:16:49] [CRIT] [-] [core] at pluggableStream.self.process_data (/usr/lib/node_modules/Haraka/line_socket.js:24:18) [2013-10-07 14:16:49] [CRIT] [-] [core] at pluggableStream. (/usr/lib/node_modules/Haraka/line_socket.js:34:44) [2013-10-07 14:16:49] [CRIT] [-] [core] at pluggableStream.EventEmitter.emit (events.js:95:17) [2013-10-07 14:16:49] [CRIT] [-] [core] at Socket. (/usr/lib/node_modules/Haraka/tls_socket.js:47:14) [2013-10-07 14:16:49] [CRIT] [-] [core] at Socket.EventEmitter.emit (events.js:95:17) [2013-10-07 14:16:49] [NOTICE] [-] [core] Shutting down

baudehlo commented 10 years ago

I've committed a fix. Please test and re-open this bug if it doesn't work.

krootee commented 10 years ago

Thanks - will do.

sogko commented 10 years ago

Had a similar issue. Haraka keeps crashing due to a bounced email

Wrote a bounce plugin, always returning a next(OK).

[DEBUG] [0C3F9456-00BB-45A6-9A31-2865E0CB9418.25] [outbound] running bounce hook in email_service.hook_bounce plugin [DEBUG] [-] [email_service.hook_bounce] hook_bounce [DEBUG] [-] [email_service.hook_bounce] "Unrecognised response from upstream server: rblsmtpd: 11.11.11.11 pid 11483: 451 http://www.barracudanetworks.com/reputation/?pr=1&ip=11.11.11.11\n" [DEBUG] [-] [email_service.hook_bounce] {"domain":null,"_events":{},"_maxListeners":10,"path":"/opt/haraka/queue/1385342052540_1_9671_3918.smtp.-----.com","filename":"1385342052540_1_9671_3918.smtp.-----.com","next_process":"1385341986158","num_failures":1,"notes":{},"file_size":105499,"data_start":301,"todo":{"domain":"-----.com.sg","rcpt_to":[{"original":"admin@-----.com.sg","user":"admin","host":"-----.com.sg"}],"mail_from":{"original":"mail@-----.com","user":"mail","host":"-----.com"},"notes":{},"uuid":"0C3F9456-00BB-45A6-9A31-2865E0CB9418.25"},"hooks_to_run":[],"mxlist":[],"hostlist":[],"bounce_error":"Unrecognised response from upstream server: rblsmtpd: 11.11.11.11 pid 11483: 451 http://www.0-------.com/reputation/?pr=1&ip=11.11.11.11\n","current_hook":[{"name":"email_service.hook_bounce","base":{},"timeout":30,"full_paths":["/opt/haraka/plugins/email_service.hook_bounce.js","/usr/local/lib/node_modules/Haraka/plugins/email_service.hook_bounce.js"],"config":{},"hooks":{"queue_ok":["hook_queue_ok"],"delivered":["hook_delivered"],"bounce":["hook_bounce"]}},"hook_bounce"]} [DEBUG] [-] [email_service.hook_bounce] {"pool":{"2555:localhost:300":{}}}} [DEBUG] [-] [email_service.hook_bounce] UUID:0C3F9456-00BB-45A6-9A31-2865E0CB9418 [DEBUG] [-] [email_service.hook_bounce] filename DOES NOT EXISTS:1385342052540_1_9671_3918.smtp.-----.com [INFO] [0C3F9456-00BB-45A6-9A31-2865E0CB9418.25] [outbound] hook=bounce plugin=email_service.hook_bounce function=hook_bounce params="Unrecognised response from upstream server: rblsmtpd: 11.11.11.11 pid 11483: 451 http://www.barracudanetworks.com/reputation/?pr=1&ip=11.11.11.11 " retval=OK msg="" [INFO] [0C3F9456-00BB-45A6-9A31-2865E0CB9418.25] [outbound] plugin responded with: 906. Not sending bounce. [PROTOCOL] [0C3F9456-00BB-45A6-9A31-2865E0CB9418.25] [outbound] S: 220 rblsmtpd.local\r\n [PROTOCOL] [0C3F9456-00BB-45A6-9A31-2865E0CB9418.25] [outbound] C: EHLO smtp.-----.com [ERROR] [0C3F9456-00BB-45A6-9A31-2865E0CB9418.25] [outbound] Ongoing connection failed: Error: write after end [INFO] [0C3F9456-00BB-45A6-9A31-2865E0CB9418.25] [outbound] Temp failing 1385342052540_1_9671_3918.smtp.-----.com for 128 seconds: Tried all MXs [PROTOCOL] [301D0BF6-555C-41B0-BE53-7518828014A7.52] [outbound] S: 220 web.de (mxweb106) Nemesis ESMTP Service ready\r\n [DEBUG] [-] [email_service.hook_bounce] "OK id=1Vkkkb-000Y2k-UM" [DEBUG] [98E6D740-00FB-4ACF-B8A8-22AE9DDCF509.47] [outbound] hook=delivered plugin=email_service.hook_bounce function=hook_delivered params="OK id=1Vkkkb-000Y2k-UM" retval=CONT msg="" [CRIT] [-] [core] Error: ENOENT, open '/opt/haraka/queue/1385342052540_1_9671_3918.smtp.-----.com' [INFO] [-] [core] Shutting down

sogko commented 10 years ago

opened another issue #388