consul connect is extremely unreliable at startup

kneufeld commented 4 years ago

I've just had multiple multi-hour outages in the past two days because consul connect proxies would not start.

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

That's it, no logs because it never actually started. Granted this is a complicated job with 5 groups and proxies between them but it used to work. It seems as soon as I had multiple jobs (built from a template) the proxies started to act up. Proxy names are all unique based on job name and group.

All I can do it just repeatedly hit restart until the gods smile on me and the proxy finally starts.

Nomad version

Started at 0.11.6 and desperately tried to upgrade to 0.12.7 to no avail.

Nomad v0.12.7 (6147cb578794cb2d0c35d68fe1791728a09bb081)

Operating system and Environment details

Ubuntu 18.04 and Ubuntu 20.04

Reproduction steps

Make a job with multiple groups and multiple connect sidecar between them (but not circular). The first instance of that job will probably work. Make 10 more.

After investing a lot of time and energy in nomad I'm afraid I'll have drop it for I have no idea what I'm going to do...

shoenig commented 4 years ago

Hi @kneufeld sorry you're having trouble with this.

Seeing the exit status 1 error message for creating bootstrap configuration indicates something went wrong when Nomad execs consul to retrieve the envoy bootstrap configuration from Consul. Did you happen to capture Consul logs while this is happening, which may shed some light into what's going on? What version of Consul are you using?

kneufeld commented 4 years ago

Nov 10 08:54:21 clients-nomad1.example.com consul[6571]:     2020-11-10T08:54:21.287-0800 [ERROR] agent.proxycfg: watch error: service_id=_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-     <redacted>-proxy-redis--sidecar-proxy id=service-http-checks:_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-<redacted>-proxy-redis- error="error filling agent cache: Internal cache failure: service '_nomad-task-bb39ddd4-951b-52d2-12a7-8bb290adebef-group-listener-<redacted>-proxy-redis-' not in agent state"

This is the only thing I can find that looks interesting.

# consul version
Consul v1.8.5
Revision 1e03567d3
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

I'd say there is a race condition between groups (I'm actually pretty positive there is) but this error happens with my redis task which is at the "bottom" on which everything else has a proxy connect to. ie: redis is the "server" proxy and does not have any "client" proxy

shoenig commented 4 years ago

Can you take a look at the Nomad agent logs when this happens? There should be a log line with the stderr output from exec-ing consul binary with relevent error message. For example something like,

2020-11-10T14:40:33.171-0600 [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=8df94e24-253c-5fed-acde-63b42af168a8 task=connect-proxy-count-api error="exit status 1" stderr="This error message goes to stderr!"

kneufeld commented 4 years ago

Found these lines

Nov 10 08:56:01 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:01.763-0800 [ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher error="exit status 1" stderr="==> No sidecar proxy registered for _nomad-task-ad06b594-d5b0-331c-6c3f-4843b2070eb8-group-housekeeper-<redacted>-centre-proxy-usher-
Nov 10 08:56:01 clients-nomad1.west.example.com nomad[23271]: "
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.161-0800 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-redis error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.161-0800 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-redis reason="Restart within policy" delay=11.306497887s
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.584-0800 [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"
Nov 10 08:56:02 clients-nomad1.west.example.com nomad[23271]:     2020-11-10T08:56:02.585-0800 [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=ad06b594-d5b0-331c-6c3f-4843b2070eb8 task=connect-proxy-<redacted>-centre-proxy-usher reason="Restart within policy" delay=11.306497887s

shoenig commented 4 years ago

Hi @kneufeld , the -beta3 release contains #9330 which may help alleviate the issue, could you give that a try? Basically Nomad since 0.12 has been re-registering services with connect sidecars any time it needed to reconcile with Consul. On a busy cluster, this could mean Nomad is doing so continuously, possibly causing enough delays in registration of new services to cause the symptoms you are seeing.

kneufeld commented 4 years ago

that's v1.0-beta3? And I assume servers and clients both need to be upgraded?

Is there a way to restart nomad on a client without restarting the docker containers? I don't want to restart nomad if that will cause the containers to migrate because that will cause an outage on any machine not running v1. And that's assuming the problem is fixed.

Thanks for the quick turn around time and follow up.

kneufeld commented 4 years ago

I migrated my clients off of nomad and into their own vms and started with docker-compose.

This allowed me to upgrade my west datacenter to to v1.0-beta3. My east dc still has clients on 0.12 but the servers over there have been upgraded.

First tests looked good but then I started getting the same bootstrap errors. I may have found an other piece of the puzzle

Evaluations area insta failing.

nomad plan job.nomad
nomad job run -check-index 0 job.nomad

Then in hashi-ui I see the following instantly:

status: failed
statusdesc: maximum attempts reached (5)
triggeredby: job-registered

followed by a never ending stream of:

status: failed
statusdesc: maximum attempts reached (5)
triggeredby: max-plan-attempts

and

Status:    blocked
StatusDescription:     created due to placement conflicts
TriggeredBy:    max-plan-attempts

but I kinda call BS on placement conflicts since the original nomad plan passed and I have more nodes in my west dc than tasks and no constraints in my job file.

Is it possible that my cluster is just corrupt somehow? If so then how can nomad be trusted?

My final hope is to migrate the rest of my clients off nomad in the east dc and re-bootstrap the entire cluster. I have zero doubts that will work at first but long term?

ps. I've nomad stop -purge several times

kneufeld commented 3 years ago

@shoenig I got the rest of my jobs off the cluster and rebuilt it. My quick test works, tasks scale etc.

I really don't know what to think...

shoenig commented 3 years ago

Hi @kneufeld sorry this has caused so much trouble. The Nomad v0.12 release was centered around major underlying network architecture changes, and we've been squashing followup bugs in the point releases since. Though we hope for a smooth transition, the changes were made before Nomad v1.0 specifically so we could provide the stability guarantees of a 1.0 release going forward.

It sounds like the rebuilt cluster is operating okay?

I suspect the evaluation failures may have been caused by the problem described in https://github.com/hashicorp/nomad/pull/9356. The fix for that isn't present in the -beta3, but should be coming in a release soon.

kneufeld commented 3 years ago

@shoenig I'm only running a single job but I never saw any issues until I tried to run many jobs. I do think the cluster was somehow corrupted which is it's own very troubling thought.

I'm opening another bug that I saw pre-rebuild and post-rebuild with only a single job running.

kneufeld commented 3 years ago

@shoenig I just saw the same insta-blocked deployments/evaulations with only a single job in the cluster. I shutdown the leader to force an election and once a new leader was chosen I could successfully deploy a job.

This was after stopping all servers and clients, deleting /var/lib/nomad. Aka, I rebuilt a brand new cluster. Could there be some lingering schmoo in consul that's breaking everything?

kneufeld commented 3 years ago

@shoenig I just received the same

envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1

error after starting 3 jobs in my cluster. This bug is not remotely fixed. I emailed my job file in issue #9431

Legogris commented 3 years ago

I'm getting the above behavior when attempting to use Connect on an existing cluster (Consul 1.9.1, Nomad 1.0.2, ACLs and TLS enabled). No errors in log about bootstrap error for Connect though, so maybe it's a similar manifestation of something else.

FWIW I can't get Connenct to work at all (using the Counter Dashboard example) even when making sure node is warm before submitting a single job. Nothing stands out in logs for consul or nomad, or the envoy containers, apart from Using deprecated option ... warnings in Envoy (tried with 1.13,1.14,1.15).

holtwilkins commented 3 years ago

I've seen this error in particular alongside the generated Nomad ACL tokens. Namely, what seems to happen is that as part of creating a new allocation, Nomad spins-up a service Consul token for the sidecar, and then immediately tries to use it. This token hasn't propagated quite yet into Consul, so I see in the Nomad logs some 403 ACL token not found errors, then eventually when the alloc keeps retrying, as it keeps using the same service token, it eventually becomes valid and the allocation succeeds.

So, it seems like the sidecar needs to wait a little bit more or have some retries built-in at an earlier phase rather than failing the whole allocation immediately and relying on the general allocation restart behaviour to sort it out.

Legogris commented 3 years ago

Sounds different from what I am experiencing here (though I think I've absolutely seen what you mention as well before).

All checks passing, services considered healthy, no errors in either consul or nomad

shevagit commented 3 years ago

We're getting the same error, without using ACL tokens, connect fails sometimes causing the tasks to fail.

"envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

betterthanbreakfast commented 3 years ago

Hello, small follow-up: We've got nomad 1.0.4 with TLS, but without ACL's atm. We've run into same issue, job fails during startup with "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

In our case, reason was in service name, name contains UPPERCASE characters. After transform em to lowercase, problem was solved.

Legogris commented 3 years ago

Got hit by this again now when updating an existing job (the only thing changed: no-op change inside a generated template)

Stopping the job and starting it again:

[ERROR] client.alloc_runner.task_runner.task_hook.envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: alloc_id=foobar task=connect-proxy-deadbeef error="exit status 1" stderr="==> No proxy ID specified. One of -proxy-id, -sidecar-for, or -gateway is required
[ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=foobar task=connect-proxy-deadbeef error="prestart hook "envoy_bootstrap" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1"

Tried the following, does not resolve it:

Stopping the job
Draining the client
Stopping nomad on the client
Doing a full reboot of the client instance
doing a GC (making sure the stopped job is forgotten)
Starting nomad on the client again
Scheduling a new instance of the job

Same sidecar proxy (there are 4 in total on the group) keeps failing with the same errors.

Consul logs around the same time:

[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.34:8300 error="rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.33:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"
[ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-foobar-group-deadbeef-1234-sidecar-proxy id=service-http-checks:_nomad-task-foobar-group-deadbeef-1234 error="error filling agent cache: Internal cache failure: service '_nomad-task-foobar-group-deadbeef-1234' not in agent state"
[ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-foobar-group-deadbeef-1234-sidecar-proxy id=service-http-checks:_nomad-task-foobar-group-deadbeef-1234 error="error filling agent cache: Internal cache failure: service '_nomad-task-foobar-group-deadbeef-1234' not in agent state"

Legogris commented 3 years ago

Stopped all other jobs related to the failing sidecar container and performed a rolling restart of all consul servers and nomad servers and rescheduled - still same.

Observation: When draining the node, the 3 other sidecar tasks for the group with the failing one are stuck in running despite the main tasks being successfully killed. First time I see a drain taking more than a few seconds. Only doing a manual stop successfully stops the job fully.

In the "zombie" sidecar tasks I see in stderr after the drain (apart from all the deprecation warnings):

[info][upstream] [source/common/upstream/cds_api_impl.cc:64] cds: add 1 cluster(s), remove 1 cluster(s)

Legogris commented 3 years ago

The error seems to trigger here: https://github.com/hashicorp/consul/blob/8f3223a98456d3fb816897bdf5eb4fbf1c9f4109/command/connect/envoy/envoy.go#L289

Legogris commented 3 years ago

Arbitrarily changing the service name (amending intentions accordingly) made it come back up again. So seems to be the exact same thing @flytocolors was having (the casing for them is most likely coincidental though as that was not a factor here)

shoenig commented 3 years ago

@Legogris are you using Ingress Gateways, and on which version of Nomad? We fixed a bug that would trigger that error in Nomad v1.0.4. If we're still running into that error with just normal sidecars we can investigate further. In particular this log line (with DEBUG enabled) should shed some light on what's happening.

shoenig commented 3 years ago

Thanks for pointing out the uppercase service name problem, @flytocolors. I've filed https://github.com/hashicorp/nomad/issues/10450 to get some validation on the Nomad side to help prevent that from happening.

shoenig commented 3 years ago

@holtwilkins I think you've described what most folks here would be intersted in, which is that there's a step during the envoy boostrap process where Nomad and Consul don't have a good way of coordinating when it's safe to actually execute the bootstrapping command.

I've created https://github.com/hashicorp/nomad/issues/10451 to track thoughts and at least a hacky "fix" to slip into v1.1.

shoenig commented 3 years ago

I don't think this would apply to anyone using Nomad, but we can also do https://github.com/hashicorp/nomad/issues/10452 which would trigger the aforementioned error if more than one sidecar is being registered per parent service. (You'd have to be doing something interesting like an out-of-band registration.)

Legogris commented 3 years ago

@shoenig Do you have some other reason to believe that uppercase service names are indeed a problem? I had the exact same behavior as @flytocolors, except no uppercase and in my case the problem was resolved by adding a 2 at the end of the service name.

We're running v1.0.4, just "normal" sidecars (no ingress gateways)

shoenig commented 3 years ago

@Legogris Sorry for the confusion, but no - the case sensitivity thing is just one of I think at least 4 causes identified so far for connect startup problems,

parent service definition propagation delay https://github.com/hashicorp/nomad/issues/10451
minted service identity token propagation delay https://github.com/hashicorp/nomad/issues/10451
upper case characters not supported when using -sidecar-for https://github.com/hashicorp/nomad/issues/10450
bug preventing more than 1 ingress gateway per group https://github.com/hashicorp/nomad/pull/9849

Those last two have definite fixes, and the first two we're trying https://github.com/hashicorp/nomad/pull/10453 - which increases the allowable wait time on Consul from 6 to 60 seconds. If that still doesn't work then we might need some help from Consul on creating a better way to coordinate on propagation.

The behavior you're seeing could also involve one of those first two points, but this log line suggests there's an insufficient Consul agent ACL at play, eventually leading to envoy bootstrapping failing.

[ERROR] agent.client: RPC failed to server: method=ConnectCA.Sign server=192.168.1.35:8300 error="rpc error making call: rpc error making call: Permission denied"

leanst-daniel commented 3 years ago

I've deployed a small environment in my lab to test our brand new deployments. The configuration only contains the very basics and no ACLs but I also run into this issue:


Jun 04 10:48:46 nomad1 consul[420]:     2021-06-04T10:48:46.827Z [INFO]  agent: Synced check: check=service:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy:2
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.247Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state" index=0
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.247Z [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state"
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [INFO]  agent: Deregistered service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state" index=0
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state"
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state" index=0
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.254Z [ERROR] agent.proxycfg: Failed to handle update from watch: service_id=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy id=service-http-checks:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002 error="error filling agent cache: Internal cache failure: service '_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002' not in agent state"
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.263Z [INFO]  agent: Deregistered service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy
Jun 04 10:49:01 nomad1 consul[420]:     2021-06-04T10:49:01.264Z [WARN]  agent: Failed to deregister service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy error="Service "_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy" does not exist"
Jun 04 10:49:07 nomad1 consul[420]:     2021-06-04T10:49:07.159Z [INFO]  agent: Synced service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001
Jun 04 10:49:07 nomad1 consul[420]:     2021-06-04T10:49:07.174Z [INFO]  agent: Synced service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.809Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001' not in agent state" index=1
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.811Z [INFO]  agent: Deregistered service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.811Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001' not in agent state" index=1
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.811Z [WARN]  agent.cache: handling error in Cache.Notify: cache-type=service-http-checks error="Internal cache failure: service '_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001' not in agent state" index=1
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.814Z [INFO]  agent: Deregistered service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.815Z [WARN]  agent: Failed to deregister service: service=_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy error="Service "_nomad-task-28cb7ab1-6781-2d16-1469-1c1045c8510a-group-api-count-api-9001-sidecar-proxy" does not exist"
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.822Z [INFO]  agent: Synced service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002
Jun 04 10:49:08 nomad1 consul[420]:     2021-06-04T10:49:08.839Z [INFO]  agent: Synced service: service=_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-proxy
Jun 04 10:49:10 nomad1 consul[420]:     2021-06-04T10:49:10.300Z [INFO]  agent: Synced check: check=service:_nomad-task-1195ec6e-e31c-0a5e-a877-f21bea58e782-group-dashboard-countdashboard-9002-sidecar-pro

Running on Nomad 1.1.0 and Consul 1.9.5

Thanks!

jhitt25 commented 2 years ago

Not to "me too" an issue, but im getting this sporadically as well. We're on consul 1.11.2 and nomad 1.2.4, using ACLs for both. Our cluster is running just fine and i can even stop/start/restart existing jobs, but I have a new job that is refusing to start right now. This job has been running off and on for the past week as i tweak things for our environment, but as of this morning it is consistently failing with this exact scenario.

We're testing a 2 node keycloak cluster in Nomad/Consul Connect with a Treafik frontend. The goal is to have 3 ports - service, health check, and jgroups (for the JBoss clustering). The jgroups port is outside connect, and the health check is a manual 'expose' because it runs on a different port than the service. This has worked fine for going on 2 weeks, but this morning started in with the same issue described here. I don't know if some facet of what we're doing adds any new information, but i figured it couldn't hurt.

I've attached our job file along with the excerpts of the consul and nomad logs as well if it can help.

nomad-job.txt consul-log.txt nomad-job.txt

jhitt25 commented 2 years ago

As an update to my addition, i was able to get past the issue by moving my non-connect'ed service (for jgroups) from the group level into the task level.

schemmerling commented 2 years ago

Well, I will intentionally leave a "me too" here - Note the following error message: Mar 19 16:51:28 TEST sh[17233]: 2022-03-19T16:51:28.860-0600 [ERROR] agent.proxycfg: Failed to handle update from watch: kind=connect-proxy proxy=global service_id=global id=service-http-checks:hello error="error filling agent cache: Internal cache failure: service 'hello' not in agent state" Consul v1.11.4

tgross commented 6 months ago

Leaving a note here based on a conversation I had with @schmichael and some folks from Consul recently. We may be running into the issue fixed for the Consul K8s plugin here: https://github.com/hashicorp/consul-k8s/pull/887 Effectively we're creating the token and then the agent may have a stale view of the world when we use that token. Unfortunately this PR isn't something we can just drop into the Nomad code base because the architecture is a good bit different, but that's a promising lead for when this issue gets revisited next.

tgross commented 5 months ago

I've got an early prototype branch b-consul-token-read-self-on-login that covers the approach described above. I'll be working on getting that completed to cover the legacy workflow next, and then figuring out a way that we can validate that the fix is useful.

Internal ref: https://hashicorp.atlassian.net/browse/NET-10051

tgross commented 4 months ago

I've just merged https://github.com/hashicorp/nomad/pull/23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here.

hashicorp / nomad