ory / kratos

Next-gen identity server replacing your Auth0, Okta, Firebase with hardened security and PassKeys, SMS, OIDC, Social Sign In, MFA, FIDO, TOTP and OTP, WebAuthn, passwordless and much more. Golang, headless, API-first. Available as a worry-free SaaS with the fairest pricing on the market!
https://www.ory.sh/kratos/?utm_source=github&utm_medium=banner&utm_campaign=kratos
Apache License 2.0
10.84k stars 936 forks source link

Throttling repeated login requests & account lock #3037

Open atreya2011 opened 1 year ago

atreya2011 commented 1 year ago

Preflight checklist

Context and scope

Goals and non-goals

The design

Based on the discussion here: https://github.com/ory/kratos/issues/654

Example rate-limit config

rate_limit:
  duration: 300 # in seconds
  backoff: 2 # the backoff factor

Example lockout config

lockout:
  max_attempts: 5
  consecutive_login_interval: 5m

APIs

No response

Data storage

Create a new table to store the login history of a specific identity. The schema is as follows:

One identity shall have many login history records One login history record belongs to one identity

Code and pseudo-code

When an identity attempts to log in:

How to track consecutive login attempts:

Degree of constraint

No response

Alternatives considered

No response

aeneasr commented 1 year ago

Thank you for this design proposal! It is well written! I do have some points here:

If the number of login attempts exceeds a specified threshold within the time period, the user will be locked out of the account.

The problem with this is that you can effectively DoS accounts and lock them by trying to repeatedly log in. To avoid this, it is recommended to set a maximum wait time (e.g. 10s) which makes brute forcing impractical, but also prevents DoSing.

If the number of failed attempts exceeds a pre-defined threshold in the config, lock the user's account by updating the active field in the identities table to false.

Which process activates the identity and how does it know whether the rate limiting has disabled it, or whether it was disabled by an administrator? I would recommend to not use the active field for this but instead use the additional table.

If the credentials are valid, insert a new record into the login_history table with the identity_id, login_time, and login_status set to 'success'.

My recommendation would be to have a table with the following layout:

CREATE TABLE selfservice_rate_limits (
  id UUID PRIMARY KEY
  nid UUID
  identity_id UUID NOT NULL REFERENCES ...
  attempts int NOT NULL
  last_attempt TIMESTAMP NOT NULL
)

And then you would do UPDATE ... SET attempts=attempts+1, last_attempt=NOW(), ... depending on how long ago last_attempt was. This will make reading the table much less expensive because we do not need to make a range query SELECT ... FROM rate_limits WHERE identity_id = ... AND login_time LIMIT 15 but instead query a single row SELECT ... FROM selfservice_rate_limits WHERE identity_id=... LIMIT 1.

At scale, range queries are very expensive and it's always better to try and do single selects whenever possible!

atreya2011 commented 1 year ago

@aeneasr

Thank you for your feedback and suggestions!

The problem with this is that you can effectively DoS accounts and lock them by trying to repeatedly log in. To avoid this, it is recommended to set a maximum wait time (e.g. 10s) which makes brute forcing impractical, but also prevents DoSing.

What about a config like this?

rate_limit:
  max_duration: 30s
  backoff_factor: 2
  max_wait_time: 10s
  max_login_attempts: 5
  login_attempt_window: 1m

If we introduce a max_wait_time parameter, the logic for rate limiting would be something like the below, based on the sample configuration above. Let me know if I am going in the right direction πŸ™πŸΌ

Which process activates the identity and how does it know whether the rate limiting has disabled it, or whether it was disabled by an administrator? I would recommend to not use the active field for this but instead use the additional table.

We could use a table with a design as follows:

Option 1

CREATE TABLE lockouts (
    id UUID PRIMARY KEY
    identity_id UUID NOT NULL REFERENCES ...
    lock_time TIMESTAMP NOT NULL,
    reason ENUM('rate_limiting', 'admin_action') NOT NULL
);

it may be beneficial to consider creating a separate table if the relationship between the identities table and the lockouts table is intended to be a one-to-many relationship. An alternative solution could be to incorporate the following columns into the selfservice_rate_limits table, rather than creating a separate one.

Option 2

locked BOOLEAN
lock_time TIMESTAMP NOT NULL,
reason ENUM('rate_limiting', 'admin_action') NOT NULL,

As for the process for unlocking the identity after the lockout, we could add an admin-only endpoint such as identities/{identity_id}/unlock to unlock the account.

My recommendation would be to have a table with the following layout:

CREATE TABLE selfservice_rate_limits (
  id UUID PRIMARY KEY
  nid UUID
  identity_id UUID NOT NULL REFERENCES ...
  attempts int NOT NULL
  last_attempt TIMESTAMP NOT NULL
)

And then you would do UPDATE ... SET attempts=attempts+1, last_attempt=NOW(), ... depending on how long ago last_attempt was. This will make reading the table much less expensive because we do not need to make a range query SELECT ... FROM rate_limits WHERE identity_id = ... AND login_time LIMIT 15 but instead query a single row SELECT ... FROM selfservice_rate_limits WHERE identity_id=... LIMIT 1.

At scale, range queries are very expensive and it's always better to try and do single selects whenever possible!

Thank you for providing the table design and sample query πŸ™‡πŸΌβ€β™‚οΈ I was also considering the same concept.

Could you please provide some clarification regarding the following πŸ™πŸΌ

depending on how long agolast_attemptwas.

aeneasr commented 1 year ago

@aeneasr

Thank you for your feedback and suggestions!

The problem with this is that you can effectively DoS accounts and lock them by trying to repeatedly log in. To avoid this, it is recommended to set a maximum wait time (e.g. 10s) which makes brute forcing impractical, but also prevents DoSing.

What about a config like this?

rate_limit:
  max_duration: 30s
  backoff_factor: 2
  max_wait_time: 10s
  max_login_attempts: 5
  login_attempt_window: 1m

I think this is great! For the first step, I think it makes sense to just choose sensible defaults. Adding more configuration options is something we'll consider if enough people want to change this (for good reason). Maybe there's a blog article from a security expert that can help use choose correct values?

If we introduce a max_wait_time parameter, the logic for rate limiting would be something like the below, based on the sample configuration above. Let me know if I am going in the right direction πŸ™πŸΌ

  • The user will have to wait for 10 seconds before trying to login after each failed login attempt.
  • Assuming the user fails 5 login attempts set by max_login_attempts within a span of 50 seconds, the next failed login attempt within the 1 minute login_attempt_window will result in the user's account being locked out.

I would recommend an linear/exponential backoff with a fixed maxmimum - so something along the lines of:

  1. failed attempt user has to wait 100ms
  2. failed attempt user has to wait 1s
  3. failed attempt user has to wait 2s
  4. failed attempt user has to wait 4s
  5. failed attempt user has to wait 8s
  6. failed attempt user has to wait 16s
  7. failed attempt user has to wait 16s
  8. failed attempt user has to wait 16s
  9. failed attempt user has to wait 16s
  • The user is free to adjust the above parameters depending on the level of security that is desirable for them.

Which process activates the identity and how does it know whether the rate limiting has disabled it, or whether it was disabled by an administrator? I would recommend to not use the active field for this but instead use the additional table.

We could use a table with a design as follows:

Option 1

CREATE TABLE lockouts (
    id UUID PRIMARY KEY
    identity_id UUID NOT NULL REFERENCES ...
    lock_time TIMESTAMP NOT NULL,
    reason ENUM('rate_limiting', 'admin_action') NOT NULL
);

That sounds good! Keep in mind that not all databases support ENUMs.

it may be beneficial to consider creating a separate table if the relationship between the identities table and the lockouts table is intended to be a one-to-many relationship. An alternative solution could be to incorporate the following columns into the selfservice_rate_limits table, rather than creating a separate one.

I think it makes sense to keep the business logic around this in one table. This will help with efficiency!

Option 2

locked BOOLEAN
lock_time TIMESTAMP NOT NULL,
reason ENUM('rate_limiting', 'admin_action') NOT NULL,

Could you elaborate on this a bit more, I'm not sure in what context to put this?

As for the process for unlocking the identity after the lockout, we could add an admin-only endpoint such as identities/{identity_id}/unlock to unlock the account.

That's a great idea! It would be great to keep the naming consistent. So for example use "lock/unlock" everywhere, or "rate limit", or "throttling" or something else. Do you have a preference?

My recommendation would be to have a table with the following layout:

CREATE TABLE selfservice_rate_limits (
  id UUID PRIMARY KEY
  nid UUID
  identity_id UUID NOT NULL REFERENCES ...
  attempts int NOT NULL
  last_attempt TIMESTAMP NOT NULL
)

And then you would do UPDATE ... SET attempts=attempts+1, last_attempt=NOW(), ... depending on how long ago last_attempt was. This will make reading the table much less expensive because we do not need to make a range query SELECT ... FROM rate_limits WHERE identity_id = ... AND login_time LIMIT 15 but instead query a single row SELECT ... FROM selfservice_rate_limits WHERE identity_id=... LIMIT 1. At scale, range queries are very expensive and it's always better to try and do single selects whenever possible!

Thank you for providing the table design and sample query πŸ™‡πŸΌβ€β™‚οΈ I was also considering the same concept.

Could you please provide some clarification regarding the following πŸ™πŸΌ

  • The role of nid in the selfservice_rate_limits table?

This is an internal value which will help (in the future) migrate between ory self-hosted and the ory network.

  • Is it a one-to-one relationship between the identities table and the selfservice_rate_limits table?

I think that makes sense, I don't think we need a 1:n relation for it - 1:n relations also are range queries which are significantly slower than unique (secondary) key reads.

depending on how long agolast_attemptwas.

  • Can we use the max_wait_time in the config to check how long ago the last_attempt was?

I'd probably use fixed values for now, to not complicate the kratos config further

tacurran commented 1 year ago

@atreya2011 from Slack 20230128 "What are Ory's ways of challenging suspicious logins for example: a. logins from new locations b. too many failed attempts c. etc. note from tacurran - this could include blacklisted devices, new credentials, weak credentials

These features will require some advanced data analysis, perhaps ML, and should be included in this issue.

atreya2011 commented 1 year ago

@aeneasr

What about a config like this?

rate_limit:
  max_duration: 30s
  backoff_factor: 2
  max_wait_time: 10s
  max_login_attempts: 5
  login_attempt_window: 1m

I think this is great! For the first step, I think it makes sense to just choose sensible defaults. Adding more configuration options is something we'll consider if enough people want to change this (for good reason). Maybe there's a blog article from a security expert that can help use choose correct values?

https://developers.cloudflare.com/waf/rate-limiting-rules/best-practices/ https://developer.okta.com/docs/reference/rl-best-practices/

I found 2 articles that discuss best practices for rate limiting in Cloudflare WAF and Okta. They emphasize the importance of customizing rate-limiting settings to match the specific needs of a website or API and suggest using a combination of IP and user identity to establish unique rate limits. Both articles advise on how to monitor the effectiveness of rate limiting and make adjustments as needed to prevent DDoS attacks and other malicious traffic. Unfortunately, they don't give specific advice on default values. I will keep searching...

I would recommend an linear/exponential backoff with a fixed maxmimum - so something along the lines of:

  1. failed attempt user has to wait 100ms
  2. failed attempt user has to wait 1s
  3. failed attempt user has to wait 2s
  4. failed attempt user has to wait 4s
  5. failed attempt user has to wait 8s
  6. failed attempt user has to wait 16s
  7. failed attempt user has to wait 16s
  8. failed attempt user has to wait 16s
  9. failed attempt user has to wait 16s

I understand the logic you've presented. Additionally, I have come across two articles that suggest setting a lockout threshold of 10 attempts or more as best practice for account lockout in Active Directory. Based on the logic discussed, it would be advisable to set the threshold at or above 10 to prevent the risk of a malicious user causing a lockout. We could keep the default value of 0 to signify no lockout.

https://www.manageengine.com/products/active-directory-audit/kb/best-practices/account-lockout-best-practices.html#:~:text=The%20recommended%20threshold%20is%2015%20to%2050.

https://www.netwrix.com/account_lockout_best_practices.html

CREATE TABLE lockouts (
    id UUID PRIMARY KEY
    identity_id UUID NOT NULL REFERENCES ...
    lock_time TIMESTAMP NOT NULL,
    reason ENUM('rate_limiting', 'admin_action') NOT NULL
);

That sounds good! Keep in mind that not all databases support ENUMs.

Thank you! If I understand correctly, Kratos utilizes go-buffalo/pop as the underlying package for handling database-related operations. Therefore, it would be appropriate to design the table as a Go struct, correct?

I think it makes sense to keep the business logic around this in one table. This will help with efficiency!

Option 2

locked BOOLEAN
lock_time TIMESTAMP NOT NULL,
reason ENUM('rate_limiting', 'admin_action') NOT NULL,

Could you elaborate on this a bit more, I'm not sure in what context to put this?

I apologize for any confusion. I had suggested adding the lockout-related parameters to the selfservice_rate_limits table, so that all the information would be in one place. I would be grateful to learn your thoughts regarding the following database design.

CREATE TABLE selfservice_rate_limits (
   id UUID PRIMARY KEY
   nid UUID
   identity_id UUID NOT NULL REFERENCES ...
   attempts int NOT NULL
   last_attempt TIMESTAMP NOT NULL
   locked BOOLEAN
   lock_time TIMESTAMP NOT NULL,
   reason varchar NOT NULL,
)

As for the process for unlocking the identity after the lockout, we could add an admin-only endpoint such as identities/{identity_id}/unlock to unlock the account.

That's a great idea! It would be great to keep the naming consistent. So for example use "lock/unlock" everywhere, or "rate limit", or "throttling" or something else. Do you have a preference?

I don't have a particular preference for this. But having an unlock endpoint for an admin to unlock a user account does seem clear and straightforward. Alternatively, we could introduce an endpoint, such as /identities/{identity_id}/rate_limit, and send a PATCH request with the necessary parameters to unlock a user account.

  • The role of nid in the selfservice_rate_limits table?

This is an internal value which will help (in the future) migrate between ory self-hosted and the ory network.

Thank you for the clarification!

  • Can we use the max_wait_time in the config to check how long ago the last_attempt was?

I'd probably use fixed values for now, to not complicate the kratos config further

I understand, that seems perfectly reasonable.

atreya2011 commented 1 year ago

@atreya2011 from Slack 20230128 "What are Ory's ways of challenging suspicious logins for example: a. logins from new locations b. too many failed attempts c. etc. note from tacurran - this could include blacklisted devices, new credentials, weak credentials

These features will require some advanced data analysis, perhaps ML, and should be included in this issue.

@tacurran

Thank you for bringing these points to my attention πŸ™πŸΌ

Regarding point A, we could enhance security by combining Two-factor authentication (2FA), which is already available in Kratos, with monitoring login activity from new locations and setting up alerts / requiring a one-time code input for any unusual behaviour. However, I believe that addressing this point deserves a separate discussion (issue), as it does not directly relate to rate-limiting and account lockout discussed in this design proposal πŸ™‡πŸΌβ€β™‚οΈ

Regarding point B, I have submitted a design proposal in this issue to implement account lockout after a specified number of failed login attempts, as well as rate-limiting consecutive login attempts.

Would you be able to provide additional details regarding point C? Is it related to creating an endpoint to counteract point A, in order for an admin to blacklist devices/IPs and prevent malicious activities?

aeneasr commented 1 year ago

@aeneasr

What about a config like this?

rate_limit:
  max_duration: 30s
  backoff_factor: 2
  max_wait_time: 10s
  max_login_attempts: 5
  login_attempt_window: 1m

I think this is great! For the first step, I think it makes sense to just choose sensible defaults. Adding more configuration options is something we'll consider if enough people want to change this (for good reason). Maybe there's a blog article from a security expert that can help use choose correct values?

https://developers.cloudflare.com/waf/rate-limiting-rules/best-practices/ https://developer.okta.com/docs/reference/rl-best-practices/

I found 2 articles that discuss best practices for rate limiting in Cloudflare WAF and Okta. They emphasize the importance of customizing rate-limiting settings to match the specific needs of a website or API and suggest using a combination of IP and user identity to establish unique rate limits. Both articles advise on how to monitor the effectiveness of rate limiting and make adjustments as needed to prevent DDoS attacks and other malicious traffic. Unfortunately, they don't give specific advice on default values. I will keep searching...

I would recommend an linear/exponential backoff with a fixed maxmimum - so something along the lines of:

  1. failed attempt user has to wait 100ms
  2. failed attempt user has to wait 1s
  3. failed attempt user has to wait 2s
  4. failed attempt user has to wait 4s
  5. failed attempt user has to wait 8s
  6. failed attempt user has to wait 16s
  7. failed attempt user has to wait 16s
  8. failed attempt user has to wait 16s
  9. failed attempt user has to wait 16s

I understand the logic you've presented. Additionally, I have come across two articles that suggest setting a lockout threshold of 10 attempts or more as best practice for account lockout in Active Directory. Based on the logic discussed, it would be advisable to set the threshold at or above 10 to prevent the risk of a malicious user causing a lockout. We could keep the default value of 0 to signify no lockout.

That sounds good to me!

https://www.manageengine.com/products/active-directory-audit/kb/best-practices/account-lockout-best-practices.html#:~:text=The%20recommended%20threshold%20is%2015%20to%2050.

https://www.netwrix.com/account_lockout_best_practices.html

CREATE TABLE lockouts (
    id UUID PRIMARY KEY
    identity_id UUID NOT NULL REFERENCES ...
    lock_time TIMESTAMP NOT NULL,
    reason ENUM('rate_limiting', 'admin_action') NOT NULL
);

That sounds good! Keep in mind that not all databases support ENUMs.

Thank you! If I understand correctly, Kratos utilizes go-buffalo/pop as the underlying package for handling database-related operations. Therefore, it would be appropriate to design the table as a Go struct, correct?

That's correct! Please keep in mind that enum is not supported in all databases, so you'll probably need to implement the enum in Go code :)

I think it makes sense to keep the business logic around this in one table. This will help with efficiency!

Option 2

locked BOOLEAN
lock_time TIMESTAMP NOT NULL,
reason ENUM('rate_limiting', 'admin_action') NOT NULL,

Could you elaborate on this a bit more, I'm not sure in what context to put this?

I apologize for any confusion. I had suggested adding the lockout-related parameters to the selfservice_rate_limits table, so that all the information would be in one place. I would be grateful to learn your thoughts regarding the following database design.

CREATE TABLE selfservice_rate_limits (
   id UUID PRIMARY KEY
   nid UUID
   identity_id UUID NOT NULL REFERENCES ...
   attempts int NOT NULL
   last_attempt TIMESTAMP NOT NULL
   locked BOOLEAN
   lock_time TIMESTAMP NOT NULL,
   reason varchar NOT NULL,
)

That looks very good to me! Please don't forget to add an index over nid, identity_id to make look up fast :) I'm not sure if we need lock_time (as we have last_attempt which should be equal to lock_time) but generally this looks great!

As for the process for unlocking the identity after the lockout, we could add an admin-only endpoint such as identities/{identity_id}/unlock to unlock the account.

That's a great idea! It would be great to keep the naming consistent. So for example use "lock/unlock" everywhere, or "rate limit", or "throttling" or something else. Do you have a preference?

I don't have a particular preference for this. But having an unlock endpoint for an admin to unlock a user account does seem clear and straightforward. Alternatively, we could introduce an endpoint, such as /identities/{identity_id}/rate_limit, and send a PATCH request with the necessary parameters to unlock a user account.

Good question! I think we have several options. From a pure REST perspective, we try to modify resources in plural:

Alternatively we could use verbs like but that would not be fully "REST"y:

I think that "block" and "unblock" are misleading, because "blocking" a user typically involves administrative intervention (user is malicious -> block them) whereas here we throttle/limit user authentication. What do you think of the following:

PATCH /identities/<id>/rate-limits?flow=login

and a table of

CREATE TABLE selfservice_login_rate_limits () ...

or alternatively a table with the design above and an additional field to deal with rate limits in the future for recovery or verification flows:

CREATE TABLE selfservice_rate_limits (
   ...
   flow_type ENUM('login', 'recovery', ...)
   ...
)
supercairos commented 1 year ago

Super excited about this being implemented.

Let me know If you need any help implementing it. @jossbnd and I can help if needed βœ‹

atreya2011 commented 1 year ago

@aeneasr

Thank you for all the feedback! Let me summarize our discussion and, if everything looks good, I'll proceed with the implementation πŸ™‚

Good question! I think we have several options. From a pure REST perspective, we try to modify resources in plural:

  • DELETE /identities/<id>/locks
  • PATCH /identities/<id>/rate-limits
  • PATCH /identities/<id>/throttles

Alternatively we could use verbs like but that would not be fully "REST"y:

  • PATCH /identities/<id>/unlock
  • PATCH /identities/<id>/unblock

I think that "block" and "unblock" are misleading, because "blocking" a user typically involves administrative intervention (user is malicious -> block them) whereas here we throttle/limit user authentication. What do you think of the following:

PATCH /identities/<id>/rate-limits?flow=login

and a table of

CREATE TABLE selfservice_login_rate_limits () ...

or alternatively a table with the design above and an additional field to deal with rate limits in the future for recovery or verification flows:

CREATE TABLE selfservice_rate_limits (
   ...
   flow_type ENUM('login', 'recovery', ...)
   ...
)

I really like the idea of PATCH /identities/<id>/rate-limits?flow=login since, as you pointed out, we can use it as a base for rate-limiting the other flows as well. I have updated the table design based on your feedback, including the removal of the lock_time field and the addition of the flow_type field. Understood regarding implementation of enums in Go code :)

CREATE TABLE selfservice_rate_limits (
    id UUID PRIMARY KEY,
    nid UUID,
    identity_id UUID NOT NULL REFERENCES identities,
    attempts int NOT NULL,
    last_attempt TIMESTAMP NOT NULL,
    locked BOOLEAN,
    reason ENUM('rate_limiting', 'admin_action') NOT NULL,
    flow_type ENUM('login', 'recovery', ...)
);

CREATE INDEX idx_identity_id ON selfservice_rate_limits (identity_id);
CREATE INDEX idx_nid ON selfservice_rate_limits (nid);

I have also simplified the configuration related to rate limiting, based on your feedback. Here is the updated version:

rate_limit:
    max_duration: 30s (default)
    backoff_factor: 2 (default)
    max_attempts_before_lockout: 0 (default)
atreya2011 commented 1 year ago

Super excited about this being implemented.

Let me know If you need any help implementing it. @jossbnd and I can help if needed hand

@supercairos Thank you for reaching out! I am planning to start working on it once I get the final approval from @aeneasr. I would greatly appreciate it if you could provide feedback on my pull request once I submit it. Thank you πŸ™πŸΌ

aeneasr commented 1 year ago

Awesome, that's a great summary! A few more points (but we're close!)

I'm not quite sure if I understand what max_attempts_before_lockout governs, could you briefly explain why we need this configuration setting?

I thought a bit more about the table. What do you think about this:

CREATE TABLE selfservice_rate_limits (
    id UUID PRIMARY KEY,
    nid UUID,
    identity_id UUID NOT NULL REFERENCES identities,
    attempts int NOT NULL,
    next_possible_attempt TIMESTAMP NOT NULL
);

I don't think we need the enums right now, reason currently has only one value (too many login attempts) and flow_type is not yet relevant.

In business logic:

  1. If next_possible_attempt is in the past, then login is permitted
  2. If login fails, attempts is increased by one, and next_possible_attempt is updated to a future timestamp (e.g. in 30 seconds)
  3. If login succeeds, next_possible_attempt is set sufficiently to the past (to prevent clock skew issues) and attempts is set to 0. Alternativley the row could be deleted but that could be problematic with transaction contention and/or multi-region databases.

My recommendation would be the following configuration layout:

selfservice:
  flows:
    login:
      rate_limit:
        max_duration: 30s
        backoff_factor: 2
        # potentially max_attempts_before_lockout

This would make the semantics very clear - this rate limit config is for login :)

atreya2011 commented 1 year ago

@aeneasr

Thank you for the feedback!

While last_attempt is a clear and straightforward option, using next_possible_attempt to determine if a user is allowed to log in works well too. I am flexible and open to either approach.

The purpose of max_attempts_before_lockout is to determine the maximum number of attempts allowed before an account lockout occurs, where login would be disallowed until manually reset by an admin.

We could prevent a malicious user from making too many failed login attempts this way.

The following articles suggest that a value of 10 or higher would be appropriate for max_attempts_before_lockout. We could set the default value to 0, signifying no lockout, and offer the flexibility for the user to adjust it as they see fit in the config :)

https://www.manageengine.com/products/active-directory-audit/kb/best-practices/account-lockout-best-practices.html#:~:text=The%20recommended%20threshold%20is%2015%20to%2050 https://www.netwrix.com/account_lockout_best_practices.html

Please let me clarify how max_attempts_before_lockout would work, by making a slight modification to your proposed business logic and taking into consideration your proposed method for calculating the backoff duration.

I would recommend an linear/exponential backoff with a fixed maxmimum - so something along the lines of:

failed attempt user has to wait 100ms failed attempt user has to wait 1s failed attempt user has to wait 2s failed attempt user has to wait 4s failed attempt user has to wait 8s failed attempt user has to wait 16s failed attempt user has to wait 30s failed attempt user has to wait 30s failed attempt user has to wait 30s

  1. If next_possible_attempt is in the past, then login is permitted.
  2. If login fails, attempts is increased by one, and next_possible_attempt is updated using an exponential backoff algorithm, with a cap at max_duration.
  3. If login succeeds, next_possible_attempt is set to a time sufficiently in the past, and in the case of Go, would be set to time.Time{} (to prevent clock skew issues) and attempts is set to 0.
  4. If attempts exceed max_attempts_before_lockout, the user will be locked out and unable to login again until the admin resets the max_attempts to 0.
  5. The admin could do this, for example, with a PATCH /identities/<id>/rate-limits and a payload of {"attempts": 0} to unlock the user.
aeneasr commented 1 year ago

Thank you very much for the explanation around max_attempts_before_lockout. I think that there are several things missing to make max_attempts_before_lockout work really well:

  1. Automated reporting for the administrator telling them that a user was locked out
  2. A way for the user to contest the lock out (why should I not be able to sign in when someone else nuked my account?)
  3. Convenient way to fetch all locked accounts easily

Since that involves a lot of work, my suggestion is to keep this feature out of scope for now and focus on the pure implementation of this feature. If an administrator wants to lock out a user, they can also set the time quite high (an hour or more) which effectively achieves the same thing. WDYT?

atreya2011 commented 1 year ago

@aeneasr

I understood points 1 and 3.

Automated reporting for the administrator telling them that a user was locked out

We can use the courier in kratos to automatically send an email to the administrator regarding the lockout.

Convenient way to fetch all locked accounts easily

If we add an additional locked BOOL column in the table, it will become easy to query and fetch all locked accounts.

A way for the user to contest the lock out (why should I not be able to sign in when someone else nuked my account?)

Can you confirm if my understanding regarding point 2 is correct? The idea behind point 2 is to provide a mechanism for the user to contest the lock-out if they believe it was unjustified. This could involve:

If an administrator wants to lock out a user, they can also set the time quite high (an hour or more) which effectively achieves the same thing. WDYT?

Yea this is also possible. While manual tracking by an administrator is possible, implementing a max_attempts_before_lockout feature would ensure automatic lockout, making it more convenient for the administrator as they wouldn't need to keep track of multiple failed login attempts from a user.

Since that involves a lot of work, my suggestion is to keep this feature out of scope for now and focus on the pure implementation of this feature.

Thank you for your input! Once I have completed the implementation of the rate-limiting feature, I will take your suggestions into consideration and prepare a revised proposal for implementing the account lockout feature, including all three points you mentioned (if that is okay with you πŸ™‚). Shall I go ahead with the implementation of the rate-limiting feature?

aeneasr commented 1 year ago

Epic, I think we're fully aligned. Thank you for your patience in this process - it definitely helps align on the vision for the feature and will make implementation much easier! Let's go! :)

atreya2011 commented 1 year ago

@aeneasr

Awesome! Likewise, I thoroughly enjoyed the brainstorming process and am grateful for your feedback and approval πŸ™πŸΌ I will proceed with implementing the discussed design πŸ™‡πŸΌ

DEVXIX commented 1 year ago

@aeneasr

Awesome! Likewise, I thoroughly enjoyed the brainstorming process and am grateful for your feedback and approval πŸ™πŸΌ I will proceed with implementing the discussed design πŸ™‡πŸΌ

Hi, I'm kinda interested in the feature, do you guys have any ETA for shipping it?

atreya2011 commented 1 year ago

@DParty I am in the same boat as well. My apologies for not being able to give you an exact estimate at this time. I understand that this may be disappointing, but please know that I am actively working on it. Thank you for your understanding πŸ™πŸΌ

credcore-dave commented 1 year ago

Awesome you are working on this @atreya2011 - I think it'll be a very valuable feature, and the org I work for also needs it. I'm not that experienced with go but if I can help with implementing or testing anything, let me know.

atreya2011 commented 1 year ago

@credcore-dave Thank you! If all goes well I should be able to get a PR up by end of this month :)

Robert-Bosse commented 1 year ago

Hello all. I was reading the conversation going on here and I think what you suggested and refined for throttling repeated requests is a very good approach if a malicious actor is trying to get access to a specific account. Do you also have an idea what to do about credential stuffing where an attacker is using (e.g. pastebin) lists of usernames/emails and passwords to try and check if he can find accounts to take over easily? Since he will only use every username/email once there is another approach needed for this attack vector.

dbhobbs commented 1 year ago

Hello all. I was reading the conversation going on here and I think what you suggested and refined for throttling repeated requests is a very good approach if a malicious actor is trying to get access to a specific account. Do you also have an idea what to do about credential stuffing where an attacker is using (e.g. pastebin) lists of usernames/emails and passwords to try and check if he can find accounts to take over easily? Since he will only use every username/email once there is another approach needed for this attack vector.

@Robert-Bosse I think this issue might be what you're looking for?

vinckr commented 1 year ago

@Robert-Bosse Ory Kratos uses the haveibeenpwned API to check against a db of leaked passwords, you can also host your own version of it. I guess this mostly helps against current/past leaks, not so much against future or secret leaks.
You can also enforce Multifactor AuthN with TOTP or WebAuthn. CAPTCHA or IP blocking only helps a bit against this attack, by far the most effective method is MFA. Ory Network also implements capabilities to fingerprint sessions to devices, that is also pretty effective. /I think the CSRF protection also helps, at least OWASP mentions it.

Happy to continue this in a dedicated discussion, to not derail this issue further

Oscmage commented 1 year ago

What is the status of the implementation for this? I can't find any PR or work in progress. Would love this feature and would happily contribute if assistance is needed.

atreya2011 commented 1 year ago

@Oscmage Work is progressing on this slowly. My apologies for the wait πŸ™‡β€β™‚οΈ

Oscmage commented 1 year ago

@atreya2011 No worries at all! Sorry if it came out like that. Happy to wait/assist :) Was just curious since this was one of the features that we were comparing our current Firebase/Google Identity Platform solution with.

SeanTasker commented 1 year ago

Similar situation here for us. Looking forward to this functionality. We've been trying to solve this problem for a couple of months but there hasn't been an elegant solution externally from Kratos. Is there anything that I could do to provide assistance, we've got a few developers that would be happy to contribute :)

atreya2011 commented 1 year ago

Similar situation here for us. Looking forward to this functionality. We've been trying to solve this problem for a couple of months but there hasn't been an elegant solution externally from Kratos. Is there anything that I could do to provide assistance, we've got a few developers that would be happy to contribute :)

@SeanTasker Personal commitments were slowing down my progress and I am hoping to make a PR this month sometime πŸ™πŸΌ Do you have a fork where your team has been trying to solve this?

SeanTasker commented 1 year ago

No problem at all! And apologies for my slow response. Our internal issue tracker was down and I wanted to check a couple of things as we originally discussed a number of options for our login page. In the end we settled on using this template https://github.com/timalanfarrow/kratos-selfservice-ui-vue3-typescript to build our own login and account creation pages. So originally we tried to solve this with a captcha, however it turned out to be insufficient as we needed to store some state server side.

We haven't forked Kratos (yet) - that's why I am here, to determine if forking will get us closer to a solution sooner. We were thinking about adding some additional hooks that would let us implement additional login restriction checks.

Let me know if we can help. We're not super familiar with the Kratos code base right now so would just need some direction to help us know where to focus our attention.

bradleyball commented 10 months ago

@atreya2011 any update on this functionality? Thanks!

frederikhors commented 8 months ago

I think should be prioritized. We had problems with this in the past days. This is so necessary. We are thinking to change Kratos for this reason. We are very scared.

atreya2011 commented 8 months ago

@bradleyball @frederikhors Progress is plodding on this due to many personal and professional commitments. I am looking at probably the end of this year or hopefully by January 2024 to get a PR up. However, it seems like a new approach is being discussed here? https://github.com/ory/kratos/issues/3580

frederikhors commented 8 months ago

@atreya2011 I think this should be done from Ory Kratos staff. Is not a nice-to-have feature. This is a security MUST.

atreya2011 commented 8 months ago

@frederikhors @aeneasr will carefully review the code before merging it to master so you don’t have to worry about who works on this issue πŸ™‚

frederikhors commented 8 months ago

No, I didn't explain myself correctly. I mean it's urgent to do. Because without it there is a BIG RISK trying a login many times until you find the correct password.

Robert-Bosse commented 8 months ago

@frederikhors Yes. I second that. It is something that is urgently needed.

kmherrmann commented 8 months ago

This is a top priority for the Ory team and we're prioritizing protections. I'll update this issue with any upcoming changes.

frederikhors commented 8 months ago

@kmherrmann any ETA? Even if not exact?

kirill-gerasimenko-da commented 7 months ago

@kmherrmann any ETA? Even if not exact?

It would be great to hear back on this.

Robert-Bosse commented 5 months ago

@kmherrmann any news for our top priority project here?

aeneasr commented 5 months ago

Hello, we have decided not to work on this as part of Ory Kratos. Rate limiting, credentials stuffing, IP rate limiting across multiple nodes, and DoS prevention are very difficult problems to solve (typically cat and mouse type problems) and it makes much more sense to solve them on an operational level with things like Gateway Ratelimiters, JA3 Fingerprinting, Anti Bot detection, API firewalls or services like Cloudflare or Akamai.

We have solved these things in Ory Network using a variety of tools, but that also implies that we are not implementing this in Ory Kratos itself.

Sorry for anyone who was waiting for this here! Thank you for your understanding.

atreya2011 commented 5 months ago

@aeneasr

Thank you for the clarification! Does this mean any PR related to this the feature won’t be accepted? Or does it mean that ORY won’t officially work on this feature? If ORY won’t officially work on this feature but is willing to accept a PR based on the design proposal above, I can pick up my implementation from where I left it off.

frederikhors commented 5 months ago

Hello, we have decided not to work on this as part of Ory Kratos. Rate limiting, credentials stuffing, IP rate limiting across multiple nodes, and DoS prevention are very difficult problems to solve (typically cat and mouse type problems) and it makes much more sense to solve them on an operational level with things like Gateway Ratelimiters, JA3 Fingerprinting, Anti Bot detection, API firewalls or services like Cloudflare or Akamai.

We have solved these things in Ory Network using a variety of tools, but that also implies that we are not implementing this in Ory Kratos itself.

Sorry for anyone who was waiting for this here! Thank you for your understanding.

Wow. This is huge.

I think we should at least add guides on docs for these serious problems.

If not, using Kratos or others is the same at this point.

aeneasr commented 5 months ago

We would certainly accept PRs if the code changes are not too huge and do not require significant maintenance on our end. They should also be scoped to tackle one problem only (eg account locking on repeatedly failed attempts). Anything that is IP related should be out of scope (such as credentials stuffing) in my view as that should be dealt with on a rate limiter service.

atreya2011 commented 5 months ago

@aeneasr

Thank you for the swift response.

That’s good to hear. However, the original design proposal was for the rate limiting feature based on identity ID (not IP address) and then implement account lock in another PR.

Is that still okay?

Robert-Bosse commented 5 months ago

@aeneasr How do we solve credential stuffing by a bot net where no IP is used more than a handful of times? We already had this kind of attack more than once.

frederikhors commented 5 months ago

@aeneasr How do we solve credential stuffing by a bot net where no IP is used more than a handful of times? We already had this kind of attack more than once.

What are you using right now? i mean, Cloudflare? What else?

Robert-Bosse commented 5 months ago

no cloudflare. custom login solution. right now we check ratio of good to bad logins for activating a captcha if the ratio gets too low.

Robert-Bosse commented 5 months ago

@aeneasr https://thehackernews.com/2024/01/microsofts-top-execs-emails-breached-in.html If you have something to check good to bad login ratio you have a chance to see this attack ;)