What and Why:
This PR focuses on trying to change how machine updates work. Instead of failing whenever one machine isn't able to update, we instead try again, making sure to not redo any work that we'd already completed. We do this by keeping track of 'app state', which is basically a list of all the machines we have in our app. This works great since it includes machine states, machine configs, and mounts.
Sometimes, certain errors aren't worth retrying. For example, if we aren't able to acquire a lease on your machine because it's already being held, then it's more likely than not that retrying won't do any good. In that case, we return an 'unrecoverable error', and choose to completely fail the deploy. In certain cases, we could try rolling back to how we were (for example, if health checks fail we could rollback), though I'm saving that implementation for a future PR
How:
We will continue to try to finish a deploy a few times, until either:
a. the deploy completes successfully, meaning we transitioned from the old app state to the new app state
b. we've exhausted a certain number of attempts to complete a deploy without success
c. we encountered an unrecoverable error, meaning that flyctl doesn't think it's worth attempting a retry
In the first case, flyctl is much more resilient to intermittent platform errors, which can happen pretty often. In the second case, it's likely that there's something wrong with either the user's environment or the platform itself. If it's the former, then @rugwirobaker 's work to move orchestration logic into a new fly machine will help with those cases in the future. If it's the latter, we're working on setting up alerting to try and learn about these cases sooner. In the third case, the goal in the future is to add suggestions to the user to help recover from these sorts of issues.
The bulk of the code is in internal/command/deploy/plan.go
Change Summary
What and Why: This PR focuses on trying to change how machine updates work. Instead of failing whenever one machine isn't able to update, we instead try again, making sure to not redo any work that we'd already completed. We do this by keeping track of 'app state', which is basically a list of all the machines we have in our app. This works great since it includes machine states, machine configs, and mounts.
Sometimes, certain errors aren't worth retrying. For example, if we aren't able to acquire a lease on your machine because it's already being held, then it's more likely than not that retrying won't do any good. In that case, we return an 'unrecoverable error', and choose to completely fail the deploy. In certain cases, we could try rolling back to how we were (for example, if health checks fail we could rollback), though I'm saving that implementation for a future PR
How: We will continue to try to finish a deploy a few times, until either: a. the deploy completes successfully, meaning we transitioned from the old app state to the new app state b. we've exhausted a certain number of attempts to complete a deploy without success c. we encountered an unrecoverable error, meaning that flyctl doesn't think it's worth attempting a retry
In the first case, flyctl is much more resilient to intermittent platform errors, which can happen pretty often. In the second case, it's likely that there's something wrong with either the user's environment or the platform itself. If it's the former, then @rugwirobaker 's work to move orchestration logic into a new fly machine will help with those cases in the future. If it's the latter, we're working on setting up alerting to try and learn about these cases sooner. In the third case, the goal in the future is to add suggestions to the user to help recover from these sorts of issues.
The bulk of the code is in
internal/command/deploy/plan.go
Related to: https://flyio.discourse.team/t/deployments-roadmap-discussion/6326 https://flyio.discourse.team/t/deployments-roadmap-redux/6451 https://flyio.discourse.team/t/deployment-recoverability/6441
Documentation