ory / kratos

Next-gen identity server replacing your Auth0, Okta, Firebase with hardened security and PassKeys, SMS, OIDC, Social Sign In, MFA, FIDO, TOTP and OTP, WebAuthn, passwordless and much more. Golang, headless, API-first. Available as a worry-free SaaS with the fairest pricing on the market!
https://www.ory.sh/kratos/?utm_source=github&utm_medium=banner&utm_campaign=kratos
Apache License 2.0
10.91k stars 942 forks source link

Strategies to prevent mass email sending for several flows #1835

Open aeneasr opened 2 years ago

aeneasr commented 2 years ago

Is your feature request related to a problem? Please describe.

There is a way to abuse verification e-mails sent by kratos-courier. A client can request frequently re-sending verification e-mail and change e-mail address frequently. In these cases, kratos send an e-mail for every client request. This behaviour leads us to become a spam source for the SMTP provider, also overcharge is possible.

I couldn't find a way to prevent that abuse case by configuring Kratos, so I tried to solve this problem on my App side. But to do this, I need the last e-mail sent date for a verifiable address and I thought the updateAt field can be used for this since the status changed after the e-mail is sent. When I look the courier code I noticed the updateAt filed doesn't update by the Pop library if the status value doesn't change(for example re-sending a verification mail). So in this patch, we force update this field.

Describe the solution you'd like

This is indeed an issue, in fact, the issue touches several parts:

  1. Creating mass new accounts if the verify hook is enabled
  2. Requesting recovery / verification for random email addresses
  3. Creating an account and continuously updating the verification email so that the verify hook triggers

Adding an indicator of last delivery is IMO not a good approach though:

  1. Some users might re-request sending the email a couple of times if it lands in the e.g. spam folder
  2. Emails will be delivered for on-existing accounts also (so you have no previous delivery time)

I'd love to hear ideas though on how to solve it. One possibility would be adding captchas to these submission forms.

aeneasr commented 2 years ago

For more context: https://github.com/ory/kratos/pull/1780

fmmoret commented 2 years ago

Opt-in deduping on identical parameters + flow ids + bucketed attempt times could create artificial stateless rate limits and could be a cheap & easy improvement. Adding opt-in captcha on top could be nice for those who use kratos' provided interface.

A lot of us users will be adding captcha on manually via our custom UIs and would want to bypass kratos' use of captcha there. FWIW, I will probably do some of that deduping in my own app. Ex: I'll ignore users spamming the reset password button and let one request propagate all the way to kratos per x (10?) minutes.

the-tunnster commented 2 years ago

What if you set up a small go routine to check when the last request call was made, and the frequency of the calls. Set a rate limit, and when it matches it, just stop making calls for a set timeline. Run a defer to close individual go routines, and clean up, allowing a fresh set of requests again after a set amount of time.

Not sure how effective this would be, considering server load, but in my opinion it would account for users who genuinely need it, and block the server from sending out unlimited emails.

fmmoret commented 2 years ago

I think maintaining that much state in an instance would force the courier to be a singleton in perpetuity -- whereas I imagine the kratos team would like courier to be horizontally scalable (just need man hours for locks).

the-tunnster commented 2 years ago

What if we add a variable to the user's schema/json file that updates in a way that it counts the number of times an email is delivered/sent out to that user's specific email address, and use this in conjunction with another field that checks when the last link sent was sent?

Whenever a new email request is called, the courier needs to check the current time stamp, and the timestamp for when the last email was sent, as well the count for the total number of emails sent over that time period. Then we divide the time period by the total number of emails sent so far. Then, we define a constant rate of count/timeperiod, which is used as an upper limit.

Using this, we can rate limit some of the emails sent, but allow the less frequent ones to go through. This should reduce server load, if im not mistaken.

mitar commented 2 years ago

I think one option to address this is to provide a way for the recipient to report spam. So e-mail should be something like:

Hi, please verify your account by clicking the following link: <link>

If you do not recognize or have not expected this e-mail, it is safe to ignore it, or report it <here>
mitar commented 2 years ago

Another option is also to include an unsubscribe link as the e-mail header. This can then be used by the mail client to automatically unsubscribe the recipient from more of those e-mails.

Gibheer commented 2 years ago

A configuration could be introduced to make courier only process so many messages in a given time frame (wouldn't the same problem be true for sending SMS?). By analyzing the source account, target mail address and other options multiple messages of the same kind could be grouped together to send only the last message (as suggested in https://github.com/ory/kratos/issues/1835#issuecomment-943766165).

As all messages are currently stored in the database the frequencies could be fetched from the database to determine if dropping messages becomes necessary. The API could also be extended with a separate field to represent the last time a message was sent to that user computed based on past sent messages.

Another possibility is to run a background job to analyze message frequencies and trigger custom flows which could then lead to ban users, trigger alarms and the like.

Benehiko commented 2 years ago

A configuration could be introduced to make courier only process so many messages in a given time frame (wouldn't the same problem be true for sending SMS?).

Interesting solution, however, what if the user is unfortunate enough to try a recovery flow while the courier is in a "waiting" state? That would entail that the user could be in a wait queue until the next batch and only receive their recovery email much later (instead of a few seconds~minute).

Another possible problem that could occur is the spammer could still spam emails into the message queue which will just delay any legitimate email. You might then also need to adjust the configuration each time based on the number of users you get into the system. Maybe some days you have higher signups or recovery flows? What happens then? Do we then need to dynamically adjust this limit of messages we can send?

By analyzing the source account, target mail address and other options multiple messages of the same kind could be grouped together to send only the last message (as suggested in https://github.com/ory/kratos/issues/1835#issuecomment-943766165).

Within which time-frame would it be acceptable to group messages? This would also mean the courier would need to still delay the sending of emails (see the previous problem with that).

As all messages are currently stored in the database the frequencies could be fetched from the database to determine if dropping messages becomes necessary. The API could also be extended with a separate field to represent the last time a message was sent to that user computed based on past sent messages.

This might lead to tons of unnecessary database calls for a singular email to be sent out. What if the user does not exist?

Another possibility is to run a background job to analyze message frequencies and trigger custom flows which could then lead to ban users, trigger alarms and the like.

Wouldn't this just be standard rate limiting? If we added rate limiting on the infrastructure level we wouldn't need such complexity in the courier. e.g. The user communicates too much with the verification flow and thus gets an IP ban for a couple of minutes or prevents the user to send more than x requests per minute and thus the courier receives no request to send out emails.

Gibheer commented 2 years ago

A configuration could be introduced to make courier only process so many messages in a given time frame (wouldn't the same problem be true for sending SMS?).

Interesting solution, however, what if the user is unfortunate enough to try a recovery flow while the courier is in a "waiting" state? That would entail that the user could be in a wait queue until the next batch and only receive their recovery email much later (instead of a few seconds~minute).

Yes, the user could end up waiting longer for its mail to arrive.

Another possible problem that could occur is the spammer could still spam emails into the message queue which will just delay any legitimate email. You might then also need to adjust the configuration each time based on the number of users you get into the system. Maybe some days you have higher signups or recovery flows? What happens then? Do we then need to dynamically adjust this limit of messages we can send?

Just sending all mails at once can land your IP or domain on block lists for excessive sending of mails. If your domain is on even one of these lists it can take weeks or even months to get unblocked again. Depending on the requirements, available infrastructure and the mail hoster it can make sense to raise or lower this bar until the sweet spot of sending messages is hit. My own mail server can send only a couple mails per day without getting on any blocklist, for others this limit may be in the hundred or thousands or is one of the trusted unlimited users. But it would be a tool in the toolbox to limit the amount of traffic.

By analyzing the source account, target mail address and other options multiple messages of the same kind could be grouped together to send only the last message (as suggested in #1835 (comment)).

Within which time-frame would it be acceptable to group messages? This would also mean the courier would need to still delay the sending of emails (see the previous problem with that).

It is not up to me to decide on the time-frame. That is an environment specific requirement. In some cases it might not be necessary at all because only company internal users can interact with kratos in other cases the public internet can interact with kratos and one is bound by traffic limits to not end up on any blocklist (see comment above).

It could be completely fine to use both the delay and grouping of mails in 30s intervals or 5min intervals. It could also be reasonable to limit the traffic only for specific flows, for example registration mails are sent in 2min intervals, while verification mails are sent in 30s intervals.

As I said above, this could be a tool not a requirement for everyone.

As all messages are currently stored in the database the frequencies could be fetched from the database to determine if dropping messages becomes necessary. The API could also be extended with a separate field to represent the last time a message was sent to that user computed based on past sent messages.

This might lead to tons of unnecessary database calls for a singular email to be sent out. What if the user does not exist?

I'm not sure what you mean with unnecessary database calls for a singular message. One can send one request to the database to fetch the last message to send and the frequency of mails sent in the last n min and the timestamp of the earliest message currently unsent. If the frequency is over a configured limit all messages up to that one would be marked as dropped because of a limit. If the frequency is below the threshold send the latest message, mark it as sent and the rest of unsent messages as deduplicated.

If someone asks for the last mail date of an account that doesn't exist, wouldn't it return just an empty result? There would be no need to go looking for mails at all.

Another possibility is to run a background job to analyze message frequencies and trigger custom flows which could then lead to ban users, trigger alarms and the like.

Wouldn't this just be standard rate limiting? If we added rate limiting on the infrastructure level we wouldn't need such complexity in the courier. e.g. The user communicates too much with the verification flow and thus gets an IP ban for a couple of minutes or prevents the user to send more than x requests per minute and thus the courier receives no request to send out emails.

It is true that it would be a standard rate limiting and that it could be done that at the frontend. But then again it is also possible to create a rate limiting directly on the mail server.

But environments are different and sometimes complicated. Working without any limit might work on one case, in another every possibility to reduce the amount of outgoing traffic can be of help. Even setting a maximum number of mails to sent out can be a huge help.

Benehiko commented 2 years ago

I don't think the limit on emails would be the best way since this could have an impact on user registration and recovery - which is directly related to the service integrating with Kratos. It is true that some services for sending out emails have a block list for excessive sending, but with managed services such as Mailgun or SendGrid I don't think it's too much of a problem (unless we really hit a couple hundred thousand emails in a given time-frame).

From my understanding your proposal will apply limits globally (all emails, legitimate and spam). I don't think this is the right approach as this can really have an impact on user experience when interacting with these flows - since it does not separate legitimate users from spammers. And correct me if I am wrong, but it seems you mentioned the use of grouping messages to prevent spam - which seems like a more viable option - but would only work against spammers trying to send to the same email address or accounts we know of.

Maybe we should take a step back and look at the problem:

  1. Creating mass new accounts if the verify hook is enabled
  2. Requesting recovery / verification for random email addresses
  3. Creating an account and continuously updating the verification email so that the verify hook triggers

Adding an indicator of last delivery is IMO not a good approach though:

  • Some users might re-request sending the email a couple of times if it lands in the e.g. spam folder
  • Emails will be delivered for on-existing accounts also (so you have no previous delivery time)

The problem with the user conducting a registration or a recovery or a verification flow is that we don't currently collect any additional information about where the flow originated from. We don't know if the correct person is even submitting the recovery flow for an email they own. The current approach is an accept all and send the email out.

With that being said, we need to think about a way to limit this only for the person doing the spamming. Maybe we should build a layer into the courier that just does a quick check on where the request is originating from, which flow is being executed and then calculate the number of requests from that origin + flow and if this falls within the category of spam. We can even do this within an in-memory database.

Gibheer commented 2 years ago

I don't think the limit on emails would be the best way since this could have an impact on user registration and recovery - which is directly related to the service integrating with Kratos. It is true that some services for sending out emails have a block list for excessive sending, but with managed services such as Mailgun or SendGrid I don't think it's too much of a problem (unless we really hit a couple hundred thousand emails in a given time-frame).

Mailgun and SendGrid may be trusted senders, but even then you can end up being blocked for some time. But please do not forget about the many that self host their stuff.

From my understanding your proposal will apply limits globally (all emails, legitimate and spam). I don't think this is the right approach as this can really have an impact on user experience when interacting with these flows - since it does not separate legitimate users from spammers. And correct me if I am wrong, but it seems you mentioned the use of grouping messages to prevent spam - which seems like a more viable option - but would only work against spammers trying to send to the same email address or accounts we know of.

In my initial comment I stated:

By analyzing the source account, target mail address and other options multiple messages of the same kind could be grouped together to send only the last message (as suggested in #1835 (comment)).

Grouping mails requires at minimum one key to go by, be it the receiver, the account or even the process type. When there is nothing to group by, then that wouldn't work. With flows open from the internet where you do not have anything to group by, a global limit is the only thing you can go by. In these cases having a global limit can be of help to stop the processing before your domain is loosing reputation.

Maybe we should take a step back and look at the problem:

  1. Creating mass new accounts if the verify hook is enabled
  2. Requesting recovery / verification for random email addresses
  3. Creating an account and continuously updating the verification email so that the verify hook triggers

Adding an indicator of last delivery is IMO not a good approach though:

  • Some users might re-request sending the email a couple of times if it lands in the e.g. spam folder
  • Emails will be delivered for on-existing accounts also (so you have no previous delivery time)

The problem with the user conducting a registration or a recovery or a verification flow is that we don't currently collect any additional information about where the flow originated from. We don't know if the correct person is even submitting the recovery flow for an email they own. The current approach is an accept all and send the email out.

For verification and recovery context is available in the database. So how is it not possible to know if the correct person is submitting the recovery flow? The mail address has to exist, otherwise anybody could recover any account and take it over. Even in the case of a registration a partial identity is available to build the context. Creating a couple thousand accounts with the same mail address would be identifiable. Creating the same account with thousands of mail addresses would be identifiable. Creating thousands of accounts with thousands of mail addresses, none of which are reused from thousands of source IPs all over the world, you need a limit on the amount of message traffic you send out (and yes, at the incoming border too). But in each case you need a different defense and different limits. Defending against the latter case doesn't help with the first case.

With that being said, we need to think about a way to limit this only for the person doing the spamming. Maybe we should build a layer into the courier that just does a quick check on where the request is originating from, which flow is being executed and then calculate the number of requests from that origin + flow and if this falls within the category of spam. We can even do this within an in-memory database.

This is eactly what https://github.com/ory/kratos/issues/1835#issuecomment-943766165 proposed. I don't see a need for an extra layer and in memory database though. Collecting the data in an in memory database would only introduce a caching layer which would skew the result depending on how much traffic the courier instance sees. Imagine a redundant courier service in different locations to guarantee interruption free processing of message deliveries with a couple thousands message requests because of some event. Introducing an extra layer storing its result in an external database would result in varying results depending on the location and time the check is run. A better approach would be to run the check at the moment of sending the mail. Sending mails is already asynchronous so taking the time to get the data correct would be okay and can be done on read only instances of the origin database.

aeneasr commented 2 years ago

Regarding one of the comments made - the issue with recovery / verification is that we currently send out emails to people even if they do not have an account. This is to avoid account enumeration. This is a great blog article about this topic:

https://www.troyhunt.com/understanding-account-enumeration-the-video-tutorial-edition/

Basically, by not exposing whether an account exists or not in the UI, and moving that info to the out-of-band communication (so email), we can avoid leaking this information.

Having said that, Kratos currently has several other means of performing account enumeration attacks. So one possibility would be to remove this email and just show an error: "this email does not exist". That would avoid the spam problem for systems that do not have issues with account enumeration.

For all others though, this is of course problematic.

In the end there's is only a few ways to protect against this:

  1. Anti-automation using CAPTCHA - this approach however is not great because it does not really work well with non-browser apps (e.g. native mobile, CLIs, ...)
  2. Some type of proof-of-work (https://stackoverflow.com/questions/50231793/proof-of-work-login-instead-of-captcha) - not a great approach either because it usually requires some type of javascript to work and it would also be incredibly slow for regular users
  3. rate limiting - also not great because it is usually IP bound - see one such example vulnerability report here: https://hackerone.com/reports/1320976
  4. rules to skip sending of emails that could be abused in spam attacks (e.g. we see 1000 emails sent out to accounts that do not exist - we skip sending those emails for the next 2 hours) - however this also isn't great because the rules need to be constantly tweaked to defend against attackers

So yeah, those are the options I know of. None of them are optimal. I think there's a few quick wins such as disabling the sending of emails to unknown accounts, but there are still possible avenues of abuse by e.g. mass registering accounts.

If anyone has better ideas and/or experience with this, I'd highly appreciate it.

Also cross-referencing this issue #138 for more context

mitar commented 2 years ago

Regarding one of the comments made - the issue with recovery / verification is that we currently send out emails to people even if they do not have an account. This is to avoid account enumeration.

Hm, but you could just respond something like "the recovery e-mail was send if the user has an account" and then only send the e-mail if the user has an account. Namely the potential attacker/enumerator who is submitting recovery forms would always get the same response (so they could not deduce if the account exists or not), while the account holder if they do get an e-mail, that e-mail is hopefully not compromised, so the attacker cannot really know if the e-mail really went our or not.

Or are you trying to protect against an attacker who can observe if the server sent out an e-mail or not?

mitar commented 2 years ago

One approach Google has for account recovery is that you have to provide a cell phone number, any cell phone number, not necessary one associated with the account, to which they send you a confirmation code before they proceed with the recovery. So this is then a bit like a proof of work (attacked has to do extra work, potentially with some monetary cost), it is harder to make it automatic, and you also get a phone number, so if somebody is doing massive number of recoveries using the same phone number, you block the phone number.

Of course, what about people without a phone. But it could be a strategy one could enable with Kratos, depending on the user base they are targeting.

aeneasr commented 2 years ago

Regarding one of the comments made - the issue with recovery / verification is that we currently send out emails to people even if they do not have an account. This is to avoid account enumeration.

Hm, but you could just respond something like "the recovery e-mail was send if the user has an account" and then only send the e-mail if the user has an account. Namely the potential attacker/enumerator who is submitting recovery forms would always get the same response (so they could not deduce if the account exists or not), while the account holder if they do get an e-mail, that e-mail is hopefully not compromised, so the attacker cannot really know if the e-mail really went our or not.

Or are you trying to protect against an attacker who can observe if the server sent out an e-mail or not?

That’s true, the issue here though is one of UX. A lot of people have more than one email address or use lists (eg foo+ory@bar.com) and forget what email they used to sign up. Sending an email confirming that the user does not exist helps understand what went wrong. But of course, to prevent email flooding, it would be a viable alternative.

One approach Google has for account recovery is that you have to provide a cell phone number, any cell phone number, not necessary one associated with the account, to which they send you a confirmation code before they proceed with the recovery. So this is then a bit like a proof of work (attacked has to do extra work, potentially with some monetary cost), it is harder to make it automatic, and you also get a phone number, so if somebody is doing massive number of recoveries using the same phone number, you block the phone number.

Of course, what about people without a phone. But it could be a strategy one could enable with Kratos, depending on the user base they are targeting.

In the context of this issue I think that sending SMS would be counterproductive as it infers a high cost on the provider. Sending SMS is one of the most expensive items for companies doing SMS verification as every sms can cost up to 15 cents depending on region. Here we would need even better protection against abuse :/

mitar commented 2 years ago

But of course, to prevent email flooding, it would be a viable alternative.

I think there is not much more information you gain from the fact that an e-mail arrived to one of your e-mail addresses saying "account does not exist" vs. not getting any e-mail to the address you entered in 15 minutes. Then you figure out that you must have used a different e-mail address and you try another one. So in my view here we have a trade off between how quickly can one go over different addresses they might be using vs. allowing email flooding. Once we formulate the issue this way, we can start asking ourselves "OK, are there other ways to speed up figuring out which e-mail address I used instead of allowing email flooding". And there is: allow user to specify multiple e-mail addresses when requesting e-mail recovery. So they can in one step try all of their e-mail addresses. And we say "if any of provided e-mail addresses matches an account, we will send recovery e-mail to it". And we do not send e-mails to addresses we do not know anything about.

Here we would need even better protection against abuse

Yea, but that is easy: not more than 3 recovery attempts per phone number per day and not more than 10 per week or something. So this works better than IP for rate limiting because it is less shared or not at all.

as it infers a high cost on the provider

Sure, but for some providers (users of Kratos) this might not be too high cost (to prevent abuse and in general misuse of their service they are providing).

So I am not saying this is the only solution which should be here, but it could be something which is available for providers to turn on if for them the cost/benefit ratio is reasonable.

aeneasr commented 2 years ago

I think there is not much more information you gain from the fact that an e-mail arrived to one of your e-mail addresses saying "account does not exist" vs. not getting any e-mail to the address you entered in 15 minutes.

That's true, I was just reiterating points made by one of the most prominent researchers in this area Troy Hunt, who drove home the point that this is how it should work (see links posted above to his blog).

Yea, but that is easy: not more than 3 recovery attempts per phone number per day and not more than 10 per week or something. So this works better than IP for rate limiting because it is less shared or not at all.

The problem is that providers will happily charge you for any attempt of sending an SMS. Regardless whether it's a legitimate destination or not.

Sure, but for some providers (users of Kratos) this might not be too high cost (to prevent abuse and in general misuse of their service they are providing).

Unfortunately I don't quite remember which platform it was (maybe it was Netflix or an app like Tinder) but SMS verification was actually the highest cost item in their total expenses. I can ask the person who told me that to verify that story :)

Sytten commented 2 years ago

I think an option to not send an email on unknown email is a must, but captcha is a good alternative.

ariep commented 1 year ago

That's true, I was just reiterating points made by one of the most prominent researchers in this area Troy Hunt, who drove home the point that this is how it should work (see links posted above to his blog).

I'd say -- also having watched the nice video by Troy Hunt linked to above -- that the essential thing is to not leak in the resulting web page whether the address is registered or not, while sending the email also if it it's not registered is more of a convenience, so the client doesn't have to check their spam and/or wait another minute or two before trying their other email address. That last UX advantage you have to balance against the disadvantage of opening up a way to send emails from your IP to any user-provided address.

Troy Hunt proposes to reduce that last problem by adding a captcha, but for some sites I would gladly give up the slight UX convenience if I can then lose the captcha and/or be sure this does not cost me my email reputation.

mitar commented 1 year ago

I have been thinking more about this and I think there should be a per-user option for users to configure this for themselves. Default should probably be that no enumeration of their account should be possible, but users should be able to enable that they are OK with enumeration but want more user friendly error messages. Because not all users are at risk or worry about being identified to have an account, but many users do get confused in sign-in/up flows.

aran commented 7 months ago

Twilio has a set of premium products for addressing SMS Pumping, which is a related attack if sms verification is implemented: https://www.twilio.com/blog/sms-pumping-fraud-solutions. For example, they offer a product called Verify, which is a dedicated API that manages the contact interaction for all verifications.

It would be nice if Kratos could function as the IDP while supporting integration with other services like this Twilio API or future Ory offerings in this domain.

An important consideration is that abuse detection is sometimes "global" in the sense of spanning numerous accounts, email addresses, phone numbers or other contact points, rather than local to a single account. One approach Kratos could take would be to cleanly expose the overall stream of activity to anti-abuse systems, and receive input from those systems to inform its decisions.

For Kratos code, some ideas at different scopes:

  1. Enrich existing support for verifiable addresses. Examples: When constructing UI, include verified-or-not information on verifiable traits. Include UI and data model support for a history of an identity's traits, to support displaying states like when there's a pending verification to a new address. Also support APIs for audit of the overall history of an identity for anti-abuse automation and customer support tooling. Support verification requests in the data model for code send and re-send requests, including a place to store metadata like originating IP addresses and client locale settings. Support continuing directly to a verification flow on web after a settings update.
  2. Expand web hooks and APIs for verification. For example, a webhook dedicated to a verification request that might send an email or not, and that reports back to Kratos on throttling or other anti-abuse control info.
  3. Reduce coupling through courier. The courier and integrated template system is super convenient but there's an issue of assigning responsibility, where the courier bundles responsibility for reliable delivery, templating, translations, and discards semantic information like the purpose and anti-abuse context of messages. A clarified design conceptually would be to view the courier component as the reliable stream of outbound communication from kratos, including full anti-abuse context, and allowing the courier stream to invoke current behaviors like a templated email as convenient quick starts.