Clients in tight failed authentication loop led to excessive load on the authentication backend

k-wall commented 2 years ago

We saw a problem in our strimzi based kafka service that occurred when a kafka client application using SASL/PLAIN had become misconfigured with the wrong credentials. The application's instances were caught in a tight authentication loop, failing authentication then quickly retrying. The service configures oauth and oauth-over-plain on the same listener.

The frequency of failed authentication attempts was sufficient to trip the authentication back-end's (Keycloak based) rate limiting (Akamai), which lead to a wider authentication outage (co-located kafka clusters affected too).

The service already utilises maxConnectionAttemptsPerSec max.connection.creation.rate to protect the kafka brokers from excessive load. Further tuning maxConnectionAttemptsPerSec to protect the authentication back-end is not really an option as we cannot predict the number of clients which will take the oauth or the oauth-over-plain route. Tuning to suit both use-cases (only the latter needed a trip to the authentication server) is hard.

k-wall commented 2 years ago

We were contemplating a strimzi-oauth feature that would cache negative authentication results on the oauth-over-plain path, so that a client credentials flow call to the authentication backend could be avoided if one had occurred recently with a negative result.

The cache would use a LRU policy (to avoid excessive growth) and entries would have a short TTL (say 1000ms) - which would prevent stale authentication decision lingering for too long.

This would help prevent excessive traffic reaching deep into a organisation's identity infrastructure.

k-wall commented 2 years ago

Q: why isn't https://cwiki.apache.org/confluence/display/KAFKA/KIP-306%3A+Configuration+for+Delaying+Response+to+Failed+Client+Authentication sufficient?

k-wall commented 2 years ago

Q: why isn't https://cwiki.apache.org/confluence/display/KAFKA/KIP-306%3A+Configuration+for+Delaying+Response+to+Failed+Client+Authentication sufficient?

Answering my own question,connection.failed.authentication.delay.ms is left at default 100ms in our case and was working. In our case, the client had many instances of the application (all misconfigured), so even with the helpful effect of connection.failed.authentication.delay.ms, the combined effect of all the instances failing authentication was still sufficient to "trip" the rate limits I talked about above.

The advantage of the caching idea I discussed above is that it would benefit the case where the application is scaled. I think it is worthwhile considering.

mstruk commented 2 years ago

If I understand correctly the idea is to implement flood protection at the level of an individual broker, rather than at the level of an individual client connection to the broker. All SASL/PLAIN clients from different IP addresses to the same OAuth over PLAIN enabled listener for the same clientId would fall into a single short-circuit cache entry. Any improperly configured client (from the above set) would affect all clients (from the above set) with the same clientId even if the rest are properly configured.

k-wall commented 2 years ago

Any improperly configured client (from the above set) would affect all clients (from the above set) with the same clientId even if the rest are properly configured.

I would make the key to the cache clientId + secret. In the case where application 1 has a misconfigured secret for the principal A and application 2 also uses principal A but has the correct secret, only application 1 are short circuited to fail.

mstruk commented 2 years ago

Ah yes, of course.

scholzj commented 2 years ago

I would make the key to the cache clientId + secret

I guess for security reasons, the key would need to be some kind of hash for security reasons.

In general, I find this problematic, because I expect there will be many other situations where this would be counter productive. Some examples:

There will be situations where the OAuth server has issues which result in errors. But you do not want these cached to block your clients when the OAuth server issues are fixed.
OAuth servers with eventually consistent synchronization might have the credentials not yet ready.

So for practical reasons, you will probably be able to cache it onyl for a very long time. So I'm not convinced this complexity is worth it. It sounds to me like fine-tuning the Kafka delay and your connection limits is much better solution to this problem.

mstruk commented 2 years ago

It's a great point. If twenty clients with same clientId and secret make an auth request that's proxied to the OAuth server, and one of them glitches, then all clients from the glitched one forward are affected. It may be for example every 20th request that glitches on average, and it may result in practical unavailability of service.

k-wall commented 2 years ago

I guess for security reasons, the key would need to be some kind of hash for security reasons.

Quite.

There will be situations where the OAuth server has issues which result in errors. But you do not want these cached to block your clients when the OAuth server issues are fixed.

I think you would discriminate by HTTP response code. You would not make a cache entry for 5xx or timeouts so the next attempt with the same clientid/secret would hit the OAuth server anew.

OAuth servers with eventually consistent synchronization might have the credentials not yet ready.

True. Users would configure the cache to have a short TTL so stale values will be soon discarded

It sounds to me like fine-tuning the Kafka delay

I think you mean connection.failed.authentication.delay.ms. That doesn't help much if application instances are scaled up to many, or many applications share the same service account.

and your connection limits is much better solution to this problem.

I think you mean max.connection.creation.rate. That's too coarse. It throttles all connections regardless of the type of authentication that they use. What I am suggesting here would just target the io.strimzi.kafka.oauth.common.OAuthAuthenticator#loginWithClientSecret()` path.

scholzj commented 2 years ago

I understand what you are suggesting. But I'm not convinced the OAuth client is the right place to do this. Even if you care only for some return codes, this is IMHO still something what works for you but what probably doesn't provide a general solution for everyone. And even where it works, it seems to solve one niche issue instead of solving the main issue => that you have clients spamming the Kafka broker with invalid connections. So likely sooner or later you will find some other problems which you need to patch for this.

mstruk commented 2 years ago

Another way to address the issue might be for rate limiting layer (e.g. Akamai) to take custom headers into account, and we could set the client's IP address as a custom header on token request to OAuth server. The layer could then apply rate limit based on Kafka client IP, rather than the broker IP. Is that something Akamai (or other providers) can handle?

k-wall commented 2 years ago

I understand what you are suggesting. But I'm not convinced the OAuth client is the right place to do this. Even if you care only for some return codes, this is IMHO still something what works for you but what probably doesn't provide a general solution for everyone.

Client applications spinning, in failed re-authentication loops, is a really common problem, especially in large organisations where responsibilities the broker/apps are spread amongst many teams. So I would counter that other Strimzi users will likely see this same problem.

And even where it works, it seems to solve one niche issue instead of solving the main issue => that you have clients spamming the Kafka broker with invalid connections.

In our use case, we don't have control over the clients. Judging by the number of KIPs which are focused on hardening Kafka against abuse cases, I would say that use-case is increasingly common. I see this suggestion as a hardening feature for Strimzi that helps protect the identity system behind it.

So likely sooner or later you will find some other problems which you need to patch for this.

I understand the fear of feature creep but I think it is only the interaction with the token endpoint that has the potential for this behaviour.

k-wall commented 2 years ago

Another way to address the issue might be for rate limiting layer (e.g. Akamai) to take custom headers into account, and we could set the client's IP address as a custom header on token request to OAuth server. The layer could then apply rate limit based on Kafka client IP, rather than the broker IP. Is that something Akamai (or other providers) can handle?

That's an interesting idea. Let me enquire.

scholzj commented 2 years ago

Client applications spinning, in failed re-authentication loops, is a really common problem, especially in large organisations where responsibilities the broker/apps are spread amongst many teams. So I would counter that other Strimzi users will likely see this same problem.

I'm not saying this isn't an existing problem. I'm saying that this is a wrong place to solve it. You should prevent them from reaching the broker in the first place, because that fixes it for all issues and not just for a niche problem of OAuth over PLAIN. It is also not true that this is limited to this particular place. I expect that the same could happen when introspecting the tokens for example.

mstruk commented 2 years ago

I expect that the same could happen when introspecting the tokens for example.

That's very true. If such a solution was implemented it would only make sense to apply it to all the other places where authentication related requests to OAuth server are performed. Those places may not be applicable to your configuration, but others may have different configuration in place (the use of introspection, userinfo endpoint ...).

k-wall commented 2 years ago

Yes, I agree with you both, any solution would need to be comprehensive and include paths such as introspection.

I am closing this issue whilst we look for resolutions elsewhere. Thanks for the engagement on this issue.

strimzi / strimzi-kafka-oauth

Clients in tight failed authentication loop led to excessive load on the authentication backend #165