Closed ragesoss closed 3 years ago
Hey there @ragesoss - sorry to hear about the upgrade failure 😞
It looks like you aren't completely out of luck. At the beginning of the upgrade Waypoint takes a snapshot of your server. I think you should be able to recover your server by the following:
nomad
instead)Please give this a shot and let us know how it worked out! Thank you ❤️ 🙏🏻
There's a couple of enhancements we can make as a team here that would make this better:
It looks like the actual server upgrade was successful, the issue was we failed to connect to the runner. If I remember right, the server upgrade command might not save its server context for the CLI until the end of the command. We should save the server context immediately after the server was successfully upgraded. If we had done that, I think this would of ended up allowing you to run things like waypoint version
or installing a new runner etc.
Next, we should add more docs to the server upgrade page to include information about runners: https://www.waypointproject.io/docs/upgrading
Thanks @briancain! I couldn't figure out how to uninstall the existing server, but I was able to run the install again (basically, by manually executing the original provisioner local-exec
command that was used to install it in the first place, with the same environment variables). This put the system back in a working state, with both server and runner jobs showing up in Nomad. After that, I was able to run waypoint server restore waypoint-server-snapshot-1629760461
successfully to restore the snapshot.
I've now run the upgrade command again, but it failed in the same way ("connection reset by peer" while trying to retrieve new auth token for runner).
Hey @ragesoss - I was able to reproduce this upgrading Waypoint server and runners in Nomad. My runner didn't fail during the upgrade, but fails immediately after it gets scheduled in nomad with a similar error.
I'm glad the restore worked however! I'm not sure why the context is invalid, after looking at the server upgrade CLI it's saving it at the spot I expected it to be. Anyway, we will take a look at this issue, thanks for reporting it and providing all of this debug information!
I have a hunch this is related to https://github.com/hashicorp/waypoint/issues/1348, i.e. when Nomad schedules a new server job, it loses the volume from the previous server. 🤔
We will use this issue to update our docs, and issue 1348 to address the underlying bug
Closing this issue because the documentation has been updated, and the original problem is resolved. :)
Describe the bug The automatic server upgrade process can fail and leave the system in an unusable state, but the error message and upgrade documentation don't provide enough info on how to recover from the situation.
In particular, if the automatic upgrade process (
waypoint server upgrade -platform=nomad -auto-approve
) fails after upgrading the server but before setting up the new Waypoint runner job (at the "Retrieving new auth token for runner" step), the upgrading documentation does not provide enough guidance on how uninstall the newly installed server and restore the snapshot (or how to retry placing the runner job, or some other way to recover).Steps to Reproduce
waypoint server upgrade -platform=nomad -auto-approve
and make it fail at the "Retrieving new auth token for runner" step.Expected behavior Documentation of upgrade process and/or error messages provide explicit instructions for recovering from the problem.
Waypoint Platform Versions Additional version and platform information to help triage the issue if applicable:
Additional context I spun up a test instance of my Waypoint/Nomad cluster (repo) to try out the upgrade process. With this fresh setup, it had server version 0.5.1 (the latest release as of now) so I expected running the upgrade procedure to basically leave the system unchanged. (I have this in production with a server that's still on 0.4.0.)
I ran the upgrade command, and this happened:
After that, I can't connect to the waypoint server:
Also, at this point, there is no longer a
waypoint-runner
job listed on the Nomad UI.Attempting to restore the snapshot created during the upgrade fails similarly: