Open timothyb89 opened 2 years ago
This behaviour is also produced if the test-attempt of calling Ping()
on the newly created admin client fails.
I think network connections flaking is much more likely to lead to this situation when the new client is being tested before the certificates are persisted.
I see two options:
I agree that adding some rollback mechanism in the generation counter is likely to be painful and at risk of compromising the purpose of the generation counter.
Option 1 seems the thing we can most likely implement easily to try and recover from minor network flaking. Realistically, we could just retry here for up to a minute or so... exiting the application early here just leaves everything in an irrecoverable state.
Option 2 seems more interesting down the road, and helps guard against longer outages than we can retry for within the application (essentially up to the TTL of the certificates themselves)
Expected behavior:
When the bot tries to renew the primary certificate but fails to write the new certificate data to disk, it should fail gracefully and allow the user to fix the underlying problem.
Current behavior:
Failure to write certificates means the bot crashes instantly without writing data. When it restarts, it'll attempt to fetch new certs and immediately trigger a generation mismatch, locking the bot. This UX is not great.
We do test destinations to ensure they're writable but this isn't fully atomic (e.g. #13227). Maybe we could add some sort of rollback mechanism to decrement the generation counter? (this seems scary)
Bug details: