Open brian-athinkingape opened 1 year ago
Hi @brian-athinkingape! I was able to reproduce the behavior you're seeing exactly. Thank you so much for providing a solid minimal example, it really helps a lot! The tl;dr is that you've hit a known design issue between Consul and Nomad around gateways, which is described by my colleague @shoenig in https://github.com/hashicorp/nomad/issues/8647#issuecomment-691290660
There's a workaround roughly described in https://github.com/hashicorp/consul/issues/10308#issuecomment-849713211. I'm going to show that workaround first and then get into the nitty-gritty of why this is happening below.
Read the current kind=ingress-gateway
config to a file, and then remove the listener:
$ consul config read -kind ingress-gateway -name test-ingress > ./ingress.json
Transform this into:
{
"Kind": "ingress-gateway",
"Name": "test-ingress",
"TLS": {
"Enabled": false
}
}
Write the new config and delete the kind=proxy-defaults
config:
$ consul config write ./ingress.json
Config entry written: ingress-gateway/test-ingress
$ consul config delete -kind proxy-defaults -name global
Config entry deleted: proxy-defaults/global
Now the second job works:
$ nomad job run ./job2.nomad
==> 2022-10-05T11:13:04-04:00: Monitoring evaluation "7ecf8803"
2022-10-05T11:13:04-04:00: Evaluation triggered by job "job2"
2022-10-05T11:13:04-04:00: Allocation "fdef5d8f" created: node "35be55c7", group "group2"
2022-10-05T11:13:04-04:00: Evaluation status changed: "pending" -> "complete"
==> 2022-10-05T11:13:04-04:00: Evaluation "7ecf8803" finished with status "complete"
Running job2 hits the error you reported:
$ nomad job run ./job2.nomad
Error submitting job: Unexpected response code: 500 (Unexpected response code: 500 (service "test-upstream" has protocol "http", which does not match defined listener protocol "tcp"))
A clue to what's going on is that job2 isn't registered at all, which means that it's happening in the initial job submission and not part of allocation setup after we've scheduled the workload. That narrows down the behavior to this block job_endpoint.go#L249-L272
in the Job.Register
RPC, which writes a configuration to Consul. (I'm also seeing that Sentinel policy enforcement is happening after we've done that, which seems backwards, but I'll address that elsewhere.)
I was a little confused by why we'd be doing this in the job register code path at all and not on the client node after an allocation is placed, but then I did some digging and found this comment https://github.com/hashicorp/nomad/issues/8647#issuecomment-691290660 from my colleague @shoenig which discusses the "multi-writer" problem we have. Ultimately Consul owns the configuration entry and it's global, so multiple Nomad clusters could be writing to it.
One way to imagine the problem is to consider what would happen if you ran both job1 and job2 at the same time! We wouldn't have any way of updating Consul correctly in this case.
So ultimately this issue is a duplicate of #8647 and something we need to fix, which I realize isn't very satisfying in the short term.
A challenging part of figuring out what to do as an operator is that the Consul CLI and UI isn't super clear on the data you need. The ingress gateway isn't exposed in the consul catalog
CLI at all. So it took me a little while to find https://github.com/hashicorp/consul/issues/10308 and develop the workaround described above.
Although this is technically a duplicate there could be unique bits to it. I'm going to keep this open and mark it for roadmapping, and crosslink to it from #8647.
Thanks, we used the workaround to resolve the issue on our production system for now, looking forward to when this can be resolved!
Nomad version
Nomad v1.3.5 (1359c2580fed080295840fb888e28f0855e42d50)
Operating system and Environment details
Ubuntu 22.04 on AWS (on a fresh EC2 instance), amd64
Consul v1.13.2 Revision 0e046bbb Build Date 2022-09-20T20:30:07Z Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Docker version 20.10.18, build b40c2f6
Issue
If I run an ingress container with the
http
protocol, I'm unable to edit it to usetcp
even after I stop the job. Even if I runnomad system gc
andnomad system reconcile summaries
, it still doesn't work. I'm also unable to edit the consul config to useIf I swap all instances of
http
andtcp
I get the same errors.Reproduction steps
Start nomad/consul in dev mode:
Set up consul to use http as default protocol (using proxy-defaults.hcl file below)
Run the first job file
After job has started, stop the job
When job stops successfully, run the second job file
Expected Result
I should be able to run job2 as normal.
Actual Result
Job file (if appropriate)
proxy-defaults.hcl
service-defaults.hcl
job1.nomad:
job2.nomad: