SMTP worker can become slow when too many invalid emails are queued up

koenmtb1 commented 4 years ago

Describe the bug When the server is unable to connect to the SMTP server it wil spam the error and using a lot of resources by retrying very frequently.

Reproducing the bug

Steps to reproduce the behavior:

Setup Kratos with incorrect SMTP details.
Add verify hook to after registration.
Do a registration through self-service API.

Server logs (Sensitive details have been changed)

5/10/2020 2:47:00 PM time="2020-05-10T12:47:00Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" message_from=example@example.com smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:01 PM time="2020-05-10T12:47:01Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" message_from=example@example.com smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:02 PM time="2020-05-10T12:47:02Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" message_from=example@example.com smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:03 PM time="2020-05-10T12:47:03Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" message_from=example@example.com smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:04 PM time="2020-05-10T12:47:04Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" message_from=example@example.com smtp_server="mail.example.com:465" smtp_ssl_enabled=true
5/10/2020 2:47:05 PM time="2020-05-10T12:47:05Z" level=error msg="Unable to send email using SMTP connection." error="dial tcp 0.0.0.0:465: connect: connection refused" message_from=example@example.com smtp_server="mail.example.com:465" smtp_ssl_enabled=true

This continues on and on.

Server configuration

dsn: OMITTED
identity:
  traits:
    default_schema_url: file://home/ory/user.schema.json
selfservice:
  strategies:
    password:
      enabled: true
  logout:
    redirect_to: OMITTED
  settings:
    request_lifespan: 1h
    privileged_session_max_age: 1h
    after:
      password:
        default_return_to: OMITTED
      profile:
        default_return_to: OMITTED
  verify:
    request_lifespan: 1h
    link_lifespan: 24h
  login:
    request_lifespan: 1h
    after:
      default_return_to: OMITTED
  registration:
    request_lifespan: 10m
    after:
      default_return_to: OMITTED
      password:
        hooks:
        - hook: verify
        - hook: session
courier:
  smtp:
    connection_uri: OMITTED
    from_address: OMITTED
serve:
  admin:
    host: 0.0.0.0
    port: 4434
  public:
    host: 0.0.0.0
    port: 4433
urls:
  self:
    public: OMITTED
    admin: OMITTED
  mfa_ui: OMITTED
  login_ui: OMITTED
  settings_ui: OMITTED
  registration_ui: OMITTED
  error_ui: OMITTED
  verify_ui: OMITTED
  default_return_to: OMITTED
  whitelisted_return_to_urls: OMITTED
secrets:
  session:
    - OMITTED
hashers:
  argon2:
    OMITTED

Expected behavior For it to try a few times increasing the time between them every time and maybe giving up after a few times. Somehow let the administrator know that it's not working correctly, but without spamming the console.

Environment

Version: v0.2.1-alpha.1
Environment: Kubernetes using oryd/kratos:latest docker image.

Additional context

aeneasr commented 4 years ago

Yeah I think we could introduce exponential backoff here!

jakhog commented 4 years ago

I started working on this, and enabling the already in-place exponential back-off when Kratos is not able to connect to the SMPT server is fairly easy. However, I started thinking about other per-message errors (not sure what they would be, maybe malformed emails or too large bodies), and started playing with adding an exponential backoff to each message, storing state in the DB. Then I suddenly realised that Kratos is supposed to be scale horizontally by sharing the DB, and that this could cause all sorts of race conditions and strange behaviour.

So, my question becomes: is it necessary to have a courier service in Kratos that queues outgoing messages before sending them to a 3rd party service?

With my current understanding, it would make more sense for the service to send it directly to the 3rd party service, and if it fails - return the error back to the flow that triggered it and subsequently fail the flow. This would probably also improve the UX if the sending an email fails (right now the user would not know that no email was sent).

If the purpose of the courier queue is to keep track of the messages that was sent, most services (e.g. SendGrid) offers this feature by default. We could of course keep the log in the DB for audit purposes as well, if that is of interest.

aeneasr commented 4 years ago

Thank you for your thoughtful comment and for working towards a contribution!

With my current understanding, it would make more sense for the service to send it directly to the 3rd party service, and if it fails - return the error back to the flow that triggered it and subsequently fail the flow. This would probably also improve the UX if the sending an email fails (right now the user would not know that no email was sent).

Message delivery is asynchronous, waiting for message delivery in UX is bad because we might end up waiting 60 seconds or more to get an aknowlegement that the message was delivered (or we will never know).

Because we don't want to depend on other 3rd party services such as RabbitMQ to keep things easy, we'll have to deal with this in Kratos (which is already the case). I don't think there are possible race conditions because we use transactions. In the rare case where a race condition might still occur (e.g. simoultaneous down to the nanosecond SELECT) we'll just end up sending two emails, which isn#t great, but it's also not a dealbreaker!

jakhog commented 4 years ago

Oh ok, I didn't know that the SMPT message posting was that slow. But then it makes total sense to to it in the background.

However, if the sending takes more than a second, then we should expect that all Kratos instances send a copy of the same email, since they are polling the same queue every second, and the message is left as queued until the send operation is finished. The now and then double emails aren't that bad, but consistently sending duplicates looks a bit strange for an end user. I guess the easiest solution for now is just to document that only one of the instances should be configured with the SMPT courier, and its up to the admin to make sure of that? Then in the future one could look at some sort of leader-election using the DB so that the config can be the same for all instances.

Another problem however is if sending fails for a message - so not the unable to connect failure but for whatever reason (email too big?) a message fails. In the current implementation, that message would be retried every second which spams the logs. The bigger problem though is that if 8 of these un-sendable messages end up at the front of the queue, Kratos would never send any more emails, always retrying those 8 every time. I have a few suggestions for a fix (or a combination): 1) Store retry-state per message, with a backoff strategy and retry count to reduce the log spam and give up at some point. 2) Continue fetching groups of 8 messages until there is no more in the queue (essentially paging the queue) 3) Mark messages that are rejected by the SMPT server as failed, and never retry. The third one is definitely the easiest, and I don't know if there is that much point in retrying per email?

aeneasr commented 4 years ago

However, if the sending takes more than a second, then we should expect that all Kratos instances send a copy of the same email, since they are polling the same queue every second, and the message is left as queued until the send operation is finished

Nice find! This is definitely an issue. I think that we originally planned to have this as it's own sub-command (e.g. kratos courier watch) but we abandoned that idea in favor of a simpler set-up. In that scenario, you would have only one cron job that watches the queue, but that's definitely no longer the case.

In my opinion, we should have four message states:

queued
processing (goes back to queue on error, goes to sent on success)
sent

In the future, we might also keep track of errors and put the message in a permanent error state if continued delivery fails.

ro-tex commented 3 years ago

Are there any workarounds for this issue while we wait for a permanent fix?

We are looking to start using Kratos in production very soon but we have 8 machines right now and we're looking to double that in the next 6 months. Sending more than a dozen duplicates of each email is not really an option for us.

We are considering removing the email credentials from all but one machine. This will make it a single point of failure, so it's definitely not a great choice but it might just make it possible for us to operate. Can you advise on whether that's a viable strategy? I am not familiar with the inner workings of Kratos, so it's hard for me to anticipate the possible problems this might create e.g. multiple nodes failing to send the email and marking it as "bad" in the DB, so it never gets sent, etc.

aeneasr commented 3 years ago

The best thing would be to create a PR with a fix! That way you have a long-term solution, help out the community, and don't spend time on hacks which potentially mess up your system.

My idea here was to remove the background worker and create a new CLI command called e.g. kratos watch courier or kratos worker which basically executes the linked function in the foreground.

This would then need to be properly documented in several places - e.g. quickstart, guides, concepts. We would also need to update the docker-compose example.

The last thing is, if DSN=memory, the mail server should actually run as part of kratos serve (so as it is today) and reject to start kratos watch courier if that is true. Because this means that the data is kept in the kratos memory process, which other processes don't have access to!

aeneasr commented 2 years ago

@jonas-jonas this might be a good next issue for you, let's discuss it when you're in munich :)

aeneasr commented 2 years ago

jonas-jonas commented 2 years ago

@aeneasr sounds great, really looking forward to that!

jonas-jonas commented 2 years ago

I think this issue is already solved, coincidentally by me, 5 months ago: https://github.com/ory/kratos/pull/2257

The PR introduced a new configuration key (courier.message_ttl), that is used to set up a "time-to-live" for failed (read: unsent) messages, so that they are removed from the queue after the specified time.

An endpoint to query these could be added by querying all messages with the status MessageStatusAbandoned. Should we do that in a follow-up ticket?

aeneasr commented 2 years ago

Oh, indeed! I'm wondering what a good value would be for message_ttl? Could we also use a retry count?

Also, it seems like this setting is currently not used in Ory Cloud - if you search for the package "dogfood" in the backoffice you will find the kratos configuration - could you please set a reasonable message TTL there?

ory / kratos

SMTP worker can become slow when too many invalid emails are queued up #402

Additional context