twmb / franz-go

franz-go contains a feature complete, pure Go library for interacting with Kafka from 0.8.0 through 3.7+. Producing, consuming, transacting, administrating, etc.
BSD 3-Clause "New" or "Revised" License
1.78k stars 182 forks source link

Need to "re-assume the IAM role when re-authenticating" to avoid "Session too short" errors on MSK #731

Closed saholman closed 4 months ago

saholman commented 4 months ago

We are regularly unable to produce to our MSK Kafka cluster because we are getting back "Session too short: SASL_AUTHENTICATION_FAILED: SASL Authentication failed" errors returned from the ProduceSync method. @twmb I know you clarified in https://github.com/twmb/franz-go/issues/249 that this was because AWS was returning the same session, even if it had <2s left. We asked AWS about this and got the following response:

"Regarding your question about if MSK will return a session that is expected to expire within 2 seconds, the internal team has confirmed that yes, if the same credentials are used to authenticate then the current session will be returned even if it is about to expire. If you are re-assuming the IAM role while authenticating this should not happen."

Can franz-go be updated to "re-assume the IAM role while authenticating" so that MSK clients don’t run into regular "Session too short" errors?

Thanks!

saholman commented 4 months ago

We are using franz-go v1.11.0. We are working on upgrading to and testing with franz-go v1.16.1.

twmb commented 4 months ago

For context, this is the chunk of code: https://github.com/twmb/franz-go/blob/master/pkg/kgo/broker.go#L938-L981 It hasn't changed in 2yr (minus a few lines in there that are unrelated). The version you're using already includes the changes -- there is no longer a 2s lower bound, it's more closer to 1s, and the client doesn't auto-reject on timeout, it always allows at least one request to go through (that was this commit) The error you're posting looks like it is coming from Amazon directly, though. It looks like Amazon itself is rejecting the franz-go request because the session -- although it is not expired -- does not have enough lifetime left? I don't get that.

For comparison, the Kafka code for determining lifetime is here: https://github.com/apache/kafka/blob/bef83ce89bb6a869cf884b37d5b5f86d8f243488/clients/src/main/java/org/apache/kafka/common/security/authenticator/SaslClientAuthenticator.java#L683-L700

Kafka's code blindly chooses between 85% and 95% of the returned lifetime, meaning if there is 1s left to the lifetime, Kafka will reauthenticate after ~900ms. If there are 10s left, Kafka reauthenticates after ~9s.

franz-go reauths based on the lifetime minus some "pessimism". For large lifetimes, between 2% and 5% of the lifetime is used -- meaning if the lifetime is 10s, franz-go will reauth after ~9.65s. For small lifetimes, a minimum pessimism of 1s is used (or 1.1x the request latency, if it is larger) -- meaning if the lifetime is 1s, franz-go reauthenticates immediately after one request goes through. In practice, the behavior should be roughly the same. I don't know what Amazon is doing on their side, but the error you're seeing is new behavior by Amazon considering the code path in franz-go that is running into this hasn't changed in 2yr.

saholman commented 4 months ago

I'm not convinced that something necessarily changed on the AWS side in the last two years because we have been seeing these "Session too short" errors for well over a year. We just didn't realize they were impactful and causing failures because our fallback mechanism was kicking in.

If I understand correctly, AWS does reject sessions even before they are expired. You can see the details here: https://github.com/twmb/franz-go/issues/249#issuecomment-1357853449. I'm not sure if that "100ms" value is still used or if they have increased that number.

I will have them review your response here and either post on their behalf or have them respond if they are willing.

I admit this seems more like a bug or inconsistency on the AWS side, but AWS has made it clear that "This behavior is not considered to be a bug by the MSK internal team, it is expected behavior. To avoid the session too short error it is suggested to re-assume the IAM role when re-authenticating."

twmb commented 4 months ago

If I'm not mistaken, re-assuming the IAM role is done automatically if you follow the example here: https://github.com/twmb/franz-go/blob/2bc29e65e9e037b1269013be8918d3b7724e4691/examples/sasl/aws_msk_iam/main.go#L40-L51 Credentials are never cached, so any time the client internally goes through the SASL flow, whatever is returned from the Authenticate function (which is the credential loading: https://github.com/twmb/franz-go/blob/2bc29e65e9e037b1269013be8918d3b7724e4691/pkg/sasl/aws/aws.go#L82-L99) is what is used at that moment to authenticate.

Basically what it sounds like is that AWS is returning that SASL credentials can be used for maybe 1s, but there's enough delay in going through and issuing a request that when the request is finally issued and received by the broker, the SASL lifetime is <100ms left and the request is rejected.

If the above^^ is the case, then there really is nothing that can be done by the client. My original implementation was to refuse SASL credentials with a <5s lifetime specifically because we can't trust network RTT during latency spikes, but I removed that due to #249. It's unreasonable for Amazon to a 1s lifetime and then expect a request to make it to the broker within that 1s. There's also literally no downside for Amazon to just return a new session whenever the old one has less than, say, 5s left.

twmb commented 4 months ago

I'm going to close this for now, but let me know if AWS gets back or if you have an idea of what to improve within the client. FWIW it behaves pretty closely to Kafka now, so I'd expect these same errors on the Java client?

saholman commented 3 months ago

Thanks. We are still waiting to hear back from AWS on this. I will update here once I hear from them.

saholman commented 2 months ago

AWS provided the following direction. "We recommend you to ... use different session name during re-authentication when using your go library. Please be cautious that you are not are not creating too many sessions as that may impact their CPU usage."

We updated the closure passed to aws.ManagedStreamingIAM to cycle through 5 different session names as it authenticates. This has completely resolved the "Session too short" issue as far as we can tell.

Thanks for your patience and quick responses! They were a huge help in getting to the right solution.

AnastasiyaDidych commented 2 months ago

Hey @saholman, my team is facing the same issue Can you please post a code example of the solution please

Thanks in advance!