superfly / flyctl

Command line tools for fly.io services
https://fly.io
Apache License 2.0
1.37k stars 226 forks source link

Better recovery #3709

Closed billyb2 closed 1 week ago

billyb2 commented 1 month ago

Change Summary

What and Why: This PR focuses on trying to change how machine updates work. Instead of failing whenever one machine isn't able to update, we instead try again, making sure to not redo any work that we'd already completed. We do this by keeping track of 'app state', which is basically a list of all the machines we have in our app. This works great since it includes machine states, machine configs, and mounts.

Sometimes, certain errors aren't worth retrying. For example, if we aren't able to acquire a lease on your machine because it's already being held, then it's more likely than not that retrying won't do any good. In that case, we return an 'unrecoverable error', and choose to completely fail the deploy. In certain cases, we could try rolling back to how we were (for example, if health checks fail we could rollback), though I'm saving that implementation for a future PR

How: We will continue to try to finish a deploy a few times, until either: a. the deploy completes successfully, meaning we transitioned from the old app state to the new app state b. we've exhausted a certain number of attempts to complete a deploy without success c. we encountered an unrecoverable error, meaning that flyctl doesn't think it's worth attempting a retry

In the first case, flyctl is much more resilient to intermittent platform errors, which can happen pretty often. In the second case, it's likely that there's something wrong with either the user's environment or the platform itself. If it's the former, then @rugwirobaker 's work to move orchestration logic into a new fly machine will help with those cases in the future. If it's the latter, we're working on setting up alerting to try and learn about these cases sooner. In the third case, the goal in the future is to add suggestions to the user to help recover from these sorts of issues.

The bulk of the code is in internal/command/deploy/plan.go

Related to: https://flyio.discourse.team/t/deployments-roadmap-discussion/6326 https://flyio.discourse.team/t/deployments-roadmap-redux/6451 https://flyio.discourse.team/t/deployment-recoverability/6441

Documentation

billyb2 commented 2 weeks ago

^ btw i'm working on adding unit testing to plan.go now, I'll make sure to have it up and reviewed before I merge