This PR extends the token task to also recreate delegation tokens before they expire.
It exposes a new config option delegation_token_lifetime_seconds in the docs we give an example of
delegation_token_lifetime_seconds: 86400 # 1 day
which is the recommended lifetime and the one we will be using internally.
Largely this config option was driven by my wanting a way to integration test this functionality without leaving a test running for 24 hours.
But I think its also going to be valuable for end users to tweak this value.
The queueing logic of new tokens to recreate is isolated into a RecreateTokenQueue type.
We then use a tokio select to await both the recreation queue and transform token requests.
We need to ensure that both of these are cancellation safe since we are using them in a select, and they are both cancellation safe, so no worries here.
A new metric is added:
shotover_kafka_delegation_token_creation_seconds
I believe its important to include the time taken as a metric because if this starts growing in production it can cause connection creation timeouts, so this metric will greatly help to diagnose such cases.
This PR extends the token task to also recreate delegation tokens before they expire.
It exposes a new config option
delegation_token_lifetime_seconds
in the docs we give an example ofwhich is the recommended lifetime and the one we will be using internally. Largely this config option was driven by my wanting a way to integration test this functionality without leaving a test running for 24 hours. But I think its also going to be valuable for end users to tweak this value.
The queueing logic of new tokens to recreate is isolated into a RecreateTokenQueue type.
We then use a tokio select to await both the recreation queue and transform token requests. We need to ensure that both of these are cancellation safe since we are using them in a select, and they are both cancellation safe, so no worries here.
A new metric is added:
shotover_kafka_delegation_token_creation_seconds
I believe its important to include the time taken as a metric because if this starts growing in production it can cause connection creation timeouts, so this metric will greatly help to diagnose such cases.