haraka / Haraka

A fast, highly extensible, and event driven SMTP server
https://haraka.github.io
MIT License
5.06k stars 661 forks source link

outbound: constant reconnect to unreachable servers #3100

Closed gramakri closed 1 year ago

gramakri commented 2 years ago

Describe the bug Haraka tries non-stop to connect to unreachable servers or servers that refuse a connection.

Expected behavior Haraka should not try to connect non-stop.

Additional context

This issue is same as https://github.com/haraka/Haraka/issues/2694 and possibly https://github.com/haraka/Haraka/issues/2507 .

I have debugged this problem and found that the core problem is this:

Issue is easy to reproduce: just enable debug logs and send mail to a random IP.

gramakri commented 2 years ago

A way to mitigate the issue is to set acquireTimeoutMillis. This will make acquire fail after acquireTimeoutMillis milli seconds (the default is to try forever). Note that this will busy loop for acquireTimeoutMillis at the minimum :-(

Not sure, what else we can do here since upstream module lacks abort feature.

msimerson commented 2 years ago

First, great job debugging the problem. I think the bits missing from outbound is an .on('factoryCreateError', .... handler. It's referenced in both of the upstream issues you cited, we have in smtp_client, be I didn't see it in a quick glance of the outbound code.

gramakri commented 2 years ago

@msimerson the factoryCreateError workaround posted in the upstream issues, just deque the last item. This won't work if we have multiple items (sockets) being created in parallel, no? There is a race and that code is not making sure that the work item being dequed, is the one that actually errored.

gramakri commented 2 years ago

It seems this is creating quite a few 100% CPU on many of our servers. @msimerson Do you have any suggestions for a workaround/fix ?

For the moment, I can submit a PR to set acquireTimeoutMillis to be a hardcoded 10 seconds. Does that sound like a reasonable timeout ? (Don't want to make this configurable since this is just a workaround till we work on a clear fix).

msimerson commented 2 years ago

Yes, anything between 10 and 60 seconds seems eminently reasonable to me.

msimerson commented 1 year ago

In looking into this further, I'm sorely tempted to just jettison all the pool code. It has been a persistent source of headaches and difficult to track down bugs for ages. And the documentation for it stinks. While debugging this and a couple other issues related to generic-pool, I've just found a place where it was throwing errors because a function in the module was removed. It wasn't mentioned in the upgrading guide.

gramakri commented 1 year ago

I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. generic-pool wants to pool database connections. A database is expected to be always available. In Haraka, we try to pool connections to servers which may or may not be there. The pooling logic has to be aware of such, which it currently isn't. We can either propose a fix upstream but IMO it's easier to just write a our own simple pool (with similar API as upstream, to help migration initially).

baudehlo commented 1 year ago

I agree with this (with the caveat that it sounds like a lot of work). Thankfully I can say I didn't write the original :)

On Wed, Nov 30, 2022 at 4:12 AM Girish Ramakrishnan < @.***> wrote:

I would agree with the change. From my (casual) reading of the code when I was debugging this, that module offers a lot of options and feature which we don't quite need. And also the usage in Haraka is quite different from what the module is designed for. generic-pool wants to pool database connections. A database is expected to be always available. In Haraka, we try to pool connections to servers which may or may not be there. The pooling logic has to be aware of such, which it currently isn't. We can either propose a fix upstream but IMO it's easier to just write a our own simple pool (with similar API as upstream, to help migration initially).

— Reply to this email directly, view it on GitHub https://github.com/haraka/Haraka/issues/3100#issuecomment-1331848221, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFBWYZZWBROZYIEXXGEQADWK4LBBANCNFSM6AAAAAAQVUSHKE . You are receiving this because you are subscribed to this thread.Message ID: @.***>