Open brian-athinkingape opened 2 months ago
Hi @brian-athinkingape! I've run the scenario you've shown above on the most recent versions of Nomad and Consul and I'm getting the same error when making a request to the ingress allocation's address. My slightly jobspecs below:
Looking up the addresses for the two allocations:
$ nomad alloc status fc6eb399
ID = fc6eb399-b703-c241-2d71-cf860f3da995
Eval ID = d161006e
Name = flask.flask[0]
...
Allocation Addresses (mode = "bridge"):
Label Dynamic Address
*fivethousand yes 10.37.105.17:24757 -> 5000
*connect-proxy-flask yes 10.37.105.17:26380 -> 26380
$ nomad alloc status aaa87a6f
ID = aaa87a6f-1456-6dbd-fdc5-1f9d51b4daf4
Eval ID = eb714850
Name = flask-ingress.flask-ingress[0]
...
Allocation Addresses (mode = "bridge"):
Label Dynamic Address
*default yes 10.37.105.17:5555
...
Making the requests, we see that Flask is up and responding to requests, but the ingress proxy isn't wired up.
$ curl 10.37.105.17:24757
Hello, World!%
$ curl 10.37.105.17:5555
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111%
The error we're both seeing is coming from the Envoy proxy, and it's because Envoy is getting a ECONNREFUSED
from the upstream service. Generally, you'll want to look at the Envoy Proxy Troubleshooting Guide for more details. It might also help to look at the Envoy bootstrap configuration files or task log files as described in service mesh troubleshooting.
In this case the "resolving common errors" troubleshooting guide doesn't have much for us, as the Envoy proxy sidecars are healthy. So I'll look at the Nomad troubleshooting guide and it tells me to check in the logs first. I looked at the ingress proxy's logs with nomad alloc logs -task connect-ingress-flask-ingress -stderr aaa8
and see the listener getting configured as I'd expect:
[2024-06-25 17:56:25.390][1][info][upstream] [source/extensions/listener_managers/listener_manager/lds_api.cc:99] lds: add/update listener 'http:0.0.0.0:5555'
The Envoy bootstrap command looks ok for the gateway as well, and the bootstrap logs are empty:
$ nomad alloc exec -task connect-ingress-flask-ingress aaa8 cat secrets/.envoy_bootstrap.cmd
connect envoy -grpc-addr unix://alloc/tmp/consul_grpc.sock -http-addr 127.0.0.1:8500 -admin-bind 127.0.0.2:19000 -address 127.0.0.1:19100 -proxy-id _nomad-task-aaa81c27-b0a4-8f0e-a12d-756a13ecef95-group-flask-ingress-flask-ingress-5555 -bootstrap -gateway ingress -token 1b9f4c91-6770-cb45-a445-601a0d2181c6
$ nomad alloc fs aaa8 alloc/logs/envoy_bootstrap.stderr.0
So then I move over to the Flask app's proxy logs with nomad alloc logs -task connect-proxy-flask -stderr fc6eb399
[2024-06-25 17:34:48.831][1][info][upstream] [source/extensions/listener_managers/listener_manager/lds_api.cc:99] lds: add/update listener 'public_listener:0.0.0.0:26380'
$ nomad alloc exec -task connect-proxy-flask fc6eb399 cat secrets/.envoy_bootstrap.cmd
connect envoy -grpc-addr unix://alloc/tmp/consul_grpc.sock -http-addr 127.0.0.1:8500 -admin-bind 127.0.0.2:19001 -address 127.0.0.1:19101 -proxy-id _nomad-task-fc6eb399-b703-c241-2d71-cf860f3da995-group-flask-flask-fivethousand-sidecar-proxy -bootstrap -token 69ad5818-841d-2ebc-f060-16c4ad930e6a
$ nomad alloc fs fc6eb399 alloc/logs/envoy_bootstrap.stderr.0
All the logs look fine. Next, let's check the ingress gateway config that's been written to Consul:
$ consul config read -kind ingress-gateway -name flask-ingress
{
"Kind": "ingress-gateway",
"Name": "flask-ingress",
"Partition": "default",
"Namespace": "default",
"TLS": {
"Enabled": false
},
"Listeners": [
{
"Port": 5555,
"Protocol": "http",
"Services": [
{
"Name": "flask",
"Hosts": [
"*"
],
"Namespace": "default",
"Partition": "default",
"TLS": {},
"RequestHeaders": {},
"ResponseHeaders": {}
}
]
}
],
"CreateIndex": 114,
"ModifyIndex": 114
}
Everything there looks ok. I then tried starting a local Connect proxy (with a Consul management token in my environment). I get the same error:
$ curl localhost:5555
upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 111%
But I can see in the proxy logs that I am connecting:
$ consul connect proxy -log-level trace -service flask-ingress -upstream flask:5555
==> Consul Connect proxy starting...
Configuration mode: Flags
Service: flask-ingress
Upstream: flask => :5555
Public listener: Disabled
==> Log data will now stream in as it occurs:
2024-06-25T15:01:02.780-0400 [DEBUG] proxy: got new config
2024-06-25T15:01:02.780-0400 [INFO] proxy: Starting listener: listener=127.0.0.1:5555->service:default/default/flask bind_addr=127.0.0.1:5555
2024-06-25T15:01:02.781-0400 [INFO] proxy: Proxy loaded config and ready to serve
2024-06-25T15:01:02.781-0400 [INFO] proxy: Parsed TLS identity: uri=spiffe://58fa1480-b3d1-4ac8-6324-91e30ccf4099.consul/ns/default/dc/dc1/svc/flask-ingress
2024-06-25T15:01:04.937-0400 [DEBUG] proxy.connect: resolved service instance: service=flask-ingress address=10.37.105.17:26380 identity=spiffe:///ns/default/dc/dc1/svc/flask
2024-06-25T15:01:04.940-0400 [DEBUG] proxy.connect: successfully connected to service instance: service=flask-ingress address=10.37.105.17:26380 identity=spiffe:///ns/default/dc/dc1/svc/flask
That's puzzling, so I went over to the client and checked that we could see the Envoy process listening if we enter the network namespace of the allocation.
$ docker-net-nsenter c625946ec76a /bin/bash
root@nomad0:/home/ubuntu# netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.2:19001 0.0.0.0:* LISTEN 4191/envoy
tcp 0 0 0.0.0.0:5000 0.0.0.0:* LISTEN 4312/python
tcp 0 0 0.0.0.0:26380 0.0.0.0:* LISTEN 4191/envoy
tcp 0 0 172.26.64.206:5000 10.37.105.17:50088 TIME_WAIT -
tcp 0 0 172.26.64.206:5000 10.37.105.17:37200 TIME_WAIT -
tcp 0 0 172.26.64.206:5000 10.37.105.17:50558 TIME_WAIT -
tcp 0 0 172.26.64.206:5000 10.37.105.17:49502 TIME_WAIT -
tcp 0 0 172.26.64.206:26380 172.26.64.1:48280 ESTABLISHED 4191/envoy
tcp 0 0 172.26.64.206:5000 10.37.105.17:41804 TIME_WAIT -
tcp 0 0 172.26.64.206:5000 10.37.105.17:36892 TIME_WAIT -
The little helper I'm using to attach to the pause
container:
If I then tcpdump -A
inside the Flask application's network namespace and make a request to the ingress, I see not only the inbound request but the response from the Flask application:
14:35:21.210642 IP 10.37.105.1.41222 > 172.26.64.209.5555: Flags [S], seq 190714019, win 64240, options [mss 1460,sackOK,TS val 2466159760 ecr 0,nop,wscale 7], length 0
...
14:35:21.958615 IP 172.26.64.206.5000 > 10.37.105.17.57436: Flags [P.], seq 18:155, ack 153, win 508, options [nop,nop,TS val 3802047657 ecr 2982470331], length 137
E...."@.@.C...@.
%i....\..h._.>.....`......
........Content-Type: text/html; charset=utf-8
Content-Length: 13
Server: Werkzeug/0.16.1 Python/3.8.1
Date: Tue, 25 Jun 2024 18:35:21 GMT
So the request is coming through and the application is sending a response, but somehow it's not making it all the way back out!
@brian-athinkingape, at this point I'm quite stumped... I'm going to try to wrangle some help from the Consul folks to see if they have thoughts on where we might be going wrong.
Aside, I want to strongly recommend that you move off of the deprecated ingress gateway to API gateway. From the Consul docs:
Ingress gateway is deprecated and will not be enhanced beyond its current capabilities. Ingress gateway is fully supported in this version but will be removed in a future release of Consul.
We've got a tutorial on deploying API Gateway on Nomad with the new Workload Identity workflow here: https://developer.hashicorp.com/nomad/tutorials/integrate-consul/deploy-api-gateway-on-nomad
Nomad version
Nomad v1.5.10 BuildDate 2023-10-30T13:26:22Z Revision 3d7f65f481c5b263d6c82f03862c27447cf1794b
Consul version
Consul v1.14.11 Revision c0c5688c Build Date 2023-10-31T13:58:53Z Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)
Docker version
Docker version 26.1.1, build 4cf5afa
Operating system and Environment details
Ubuntu 22.04.4 LTS (fresh install using AWS image) AWS c6a.xlarge
Issue
I'm running a setup with Nomad + Consul Connect (I'm providing a simplified test case of the problems we're encountering in our actual systems). I'm trying to set up a service running in Docker, listening on some port (let's say that we can't customize it for some reason; in this example it's Flask listening at port 5000). I also want to set up an ingress gateway to forward requests to that service (and have it listen on port 5555).
If I set up the Flask container with a port where
to = 5000
, then my ingress gateway fails even if the container is running. If I also setstatic = 5000
on the Flask port, then everything works fine. However, I can't set that in production, since there will be multiple copies of the container running on a server.Reproduction steps
Run the two job files specified below. My server is running at 10.16.0.151. When I run
curl http://10.16.0.151:<dynamic port allocated by Nomad to the Flask container>
I get a 200 response with a body ofHello, World!
as expected. However, runningcurl
to the static ingress port does not give me the correct behaviour.Expected Result
When I run
curl http://10.16.0.151:5555
I should also get a 200 response with a body ofHello, World!
.Actual Result
When I run
curl http://10.16.0.151:5555
I get a 503 response with a body ofupstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
.However, if I uncomment the
# static = 5000
line in the Flask file, then the host port and container port match (both = 5000), andcurl
ing the ingress container returns the expected 200 response.Job file (if appropriate)
Nomad config
Consul config
Flask job
Ingress job