hashicorp / waypoint

A tool to build, deploy, and release any application on any platform.
https://waypointproject.io
Other
4.76k stars 327 forks source link

Procedure for recovering from failed `waypoint upgrade` is not documented #2113

Closed ragesoss closed 3 years ago

ragesoss commented 3 years ago

Describe the bug The automatic server upgrade process can fail and leave the system in an unusable state, but the error message and upgrade documentation don't provide enough info on how to recover from the situation.

In particular, if the automatic upgrade process (waypoint server upgrade -platform=nomad -auto-approve) fails after upgrading the server but before setting up the new Waypoint runner job (at the "Retrieving new auth token for runner" step), the upgrading documentation does not provide enough guidance on how uninstall the newly installed server and restore the snapshot (or how to retry placing the runner job, or some other way to recover).

Steps to Reproduce

Expected behavior Documentation of upgrade process and/or error messages provide explicit instructions for recovering from the problem.

Waypoint Platform Versions Additional version and platform information to help triage the issue if applicable:

Additional context I spun up a test instance of my Waypoint/Nomad cluster (repo) to try out the upgrade process. With this fresh setup, it had server version 0.5.1 (the latest release as of now) so I expected running the upgrade procedure to basically leave the system unchanged. (I have this in production with a server that's still on 0.4.0.)

I ran the upgrade command, and this happened:

 !  2.7.1  ~/h/dashboard_TESTING   *-  waypoint server upgrade -platform=nomad -auto-approve                                                                                                                        141ms  Mon 23 Aug 2021 04:14:12 PM PDT
✓ Context "wikiedu-testing" validated and connected successfully.
✓ Snapshot of server written to: 'waypoint-server-snapshot-1629760461'

» Upgrading...
  Waypoint server will now upgrade from version "v0.5.1"
✓ Detected existing Waypoint server
✓ Upgrade of Waypoint server on Nomad complete!

» Verifying upgrade...
✓ Server connection verified!

» Upgrading runner if required...
✓ Previous runner uninstalled
✓ Waypoint runner job and allocations purged
❌ Retrieving new auth token for runner...
! Error retrieving auth token for runner: error reading from server: read tcp 192.168.0.5:47758->74.207.248.187:20134:
  read: connection reset by peer

  The Waypoint runner failed to install. This error occurred after the
  Waypoint server was successfully installed. Your CLI is configured to
  use the installed server. If you want to retry, you must uninstall the
  server first.

After that, I can't connect to the waypoint server:

 2.7.1  ~/h/dashboard_TESTING   *-…  waypoint -v                                                                                                                                                              Tue 24 Aug 2021 08:43:58 AM PDT
CLI: v0.5.1 (76d7e17f)
Error connecting to server to read server version: context deadline exceeded

Also, at this point, there is no longer a waypoint-runner job listed on the Nomad UI.

Attempting to restore the snapshot created during the upgrade fails similarly:

2.7.1  ~/h/dashboard_TESTING   *-…  waypoint server restore waypoint-server-snapshot-1629760461                                                                                                              Tue 24 Aug 2021 09:28:55 AM PDT
! failed to create client: context deadline exceeded
briancain commented 3 years ago

Hey there @ragesoss - sorry to hear about the upgrade failure 😞

It looks like you aren't completely out of luck. At the beginning of the upgrade Waypoint takes a snapshot of your server. I think you should be able to recover your server by the following:

Please give this a shot and let us know how it worked out! Thank you ❤️ 🙏🏻

There's a couple of enhancements we can make as a team here that would make this better:

It looks like the actual server upgrade was successful, the issue was we failed to connect to the runner. If I remember right, the server upgrade command might not save its server context for the CLI until the end of the command. We should save the server context immediately after the server was successfully upgraded. If we had done that, I think this would of ended up allowing you to run things like waypoint version or installing a new runner etc.

Next, we should add more docs to the server upgrade page to include information about runners: https://www.waypointproject.io/docs/upgrading

ragesoss commented 3 years ago

Thanks @briancain! I couldn't figure out how to uninstall the existing server, but I was able to run the install again (basically, by manually executing the original provisioner local-exec command that was used to install it in the first place, with the same environment variables). This put the system back in a working state, with both server and runner jobs showing up in Nomad. After that, I was able to run waypoint server restore waypoint-server-snapshot-1629760461 successfully to restore the snapshot.

I've now run the upgrade command again, but it failed in the same way ("connection reset by peer" while trying to retrieve new auth token for runner).

briancain commented 3 years ago

Hey @ragesoss - I was able to reproduce this upgrading Waypoint server and runners in Nomad. My runner didn't fail during the upgrade, but fails immediately after it gets scheduled in nomad with a similar error.

I'm glad the restore worked however! I'm not sure why the context is invalid, after looking at the server upgrade CLI it's saving it at the spot I expected it to be. Anyway, we will take a look at this issue, thanks for reporting it and providing all of this debug information!

briancain commented 3 years ago

I have a hunch this is related to https://github.com/hashicorp/waypoint/issues/1348, i.e. when Nomad schedules a new server job, it loses the volume from the previous server. 🤔

krantzinator commented 3 years ago

We will use this issue to update our docs, and issue 1348 to address the underlying bug

xiaolin-ninja commented 3 years ago

Closing this issue because the documentation has been updated, and the original problem is resolved. :)