hashicorp / waypoint

A tool to build, deploy, and release any application on any platform.
https://waypointproject.io
Other
4.76k stars 327 forks source link

Waypoint commands hang when waypoint runner hits `invalid authentication token` error #2748

Open ragesoss opened 2 years ago

ragesoss commented 2 years ago

Describe the bug I ran into an error today (waypoint.runner.config_recv: error receiving configuration, exiting: err="rpc error: code = unknown desc = unknown key: invalid authentication token").

The situation was as follows:

My production app (dashboard.wikiedu.org) runs on a nomad cluster with waypoint. It had been running smoothly with no immediately recent changes or deployments. For currently unknown reasons, all the jobs running the rails app started failing, and nomad could not re-place them. I attempted to use waypoint to start a new deployment, but all waypoint commands (waypoint up, waypoint deploy) would just hang. I found the above error when running waypoint commands with the -vvv flag.

Despite that error, nomad showed the waypoint-server and waypoint-runner jobs running and healthy. I was able to restore the system to working condition by stopping and removing the waypoint-server and waypoint-runner jobs from the nomad cluster, then re-installing waypoint and doing a fresh waypoint up.

I'm not sure where to begin for figuring out what went wrong.

Steps to Reproduce ?

Expected behavior When waypoint is misconfigured such that the server can't communicate with the runner (or whatever the broken state I ran into was), it should display relevant error information (without verbose flags).

Waypoint Platform Versions

evanphx commented 2 years ago

Hi @ragesoss,

First off, congrats on the production app with Waypoint!

So, the nomad waypoint install process has gone through a lot of changes the last few months to make it more robust. One of the biggest ones was the ability to make sure that the waypoint server's data.db file is stored on a persistent volume. It's possible you don't have that setup and the allocation was killed and recreated, which would have cause the data.db file to be lost, resulting in your error.

Can you use nomad to pull out the information about the allocation and the job specification for us to look at? From there, we should be able to figure it out.

ragesoss commented 2 years ago

@evanphx thanks!

The lack of a persistent volume for the waypoint server was a problem I ran into about two weeks before I hit this issue; it was then that I removed and re-installed the waypoint server to upgrade to version 0.6.2, and had to add the persistent volume to get that to work. So at the time I hit the issue, it had been running 0.6.2 with a persistent volume for more than a week.

My workaround was to remove and re-install the waypoint server, so the allocation and job that broke are gone; the nomad UI shows the current both the server and runner as version 0. If there's a way to get allocation / job specification data for the now-removed broken allocation, I don't where to look for it. (If data from the current, working allocations/jobs would be useful, I can provide that.)

All that story was intended mainly as context, though, for what I think is a small, concrete bug: that waypoint was hanging and did not output any error details by default when it hit the token error.

(Related to this story and the waypoint install process... I've been creating and destroying test clusters for the same app in recent weeks, and one thing that's been happening very consistently that did not happen the same way earlier is that every time the provisioner for installing waypoint runs on a new cluster, it installs the server but then goes through 12(?) tries without installing the runner. That leaves the provisioner in a 'dirty' state, and I can then immediately do another terraform apply and it tries again, finds the existing server, and succeeds in installing the runner.)