gravitational / teleport

The easiest, and most secure way to access and protect all of your infrastructure.
https://goteleport.com
GNU Affero General Public License v3.0
17.35k stars 1.74k forks source link

Machine ID's primary certificate renewal is not atomic enough #13228

Open timothyb89 opened 2 years ago

timothyb89 commented 2 years ago

Expected behavior:

When the bot tries to renew the primary certificate but fails to write the new certificate data to disk, it should fail gracefully and allow the user to fix the underlying problem.

Current behavior:

Failure to write certificates means the bot crashes instantly without writing data. When it restarts, it'll attempt to fetch new certs and immediately trigger a generation mismatch, locking the bot. This UX is not great.

We do test destinations to ensure they're writable but this isn't fully atomic (e.g. #13227). Maybe we could add some sort of rollback mechanism to decrement the generation counter? (this seems scary)

Bug details:

strideynet commented 2 years ago

This behaviour is also produced if the test-attempt of calling Ping() on the newly created admin client fails.

strideynet commented 1 year ago

I think network connections flaking is much more likely to lead to this situation when the new client is being tested before the certificates are persisted.

I see two options:

  1. Introduce some retries into the process of building the new admin client. This should give us some recovery against minor network connection issues.
  2. Write the new certificates to disk immediately with some kind of label showing them as secondary. If creating the new auth client succeeds, write these as primary to the disk. When loading tbot, if authentication fails with the primary certificates, use the secondary certificates. If these succeed, make them the primary certificates.

I agree that adding some rollback mechanism in the generation counter is likely to be painful and at risk of compromising the purpose of the generation counter.

Option 1 seems the thing we can most likely implement easily to try and recover from minor network flaking. Realistically, we could just retry here for up to a minute or so... exiting the application early here just leaves everything in an irrecoverable state.

Option 2 seems more interesting down the road, and helps guard against longer outages than we can retry for within the application (essentially up to the TTL of the certificates themselves)