Closed maxadamo closed 4 years ago
see issue => https://github.com/haproxy/haproxy/issues/709
Hi @maxadamo! Sorry to hear you're running into trouble! I'm not much of a HAProxy expert, but with respect to the two different error messages:
_tcp
tag or whether there's an additional field in the HAProxy config you need to make that work.Server nomad/nomad1 is DOWN, reason: Socket error
are you seeing any network connection being made to Nomad? Or is this another case of the DNS error manifesting differently in the HAProxy logs? It'd be worth checking the Nomad server logs and/or running tcpdump
on the interface you're connecting to, in order to diagnose what network traffic is actually flowing.@tgross good idea. I'm leaving tcpdump port 4646
on Nomad and I see nothing coming coming in.
Meanwhile, in HAProxy I see:
Server nomad/nomad3 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
In addition to the above situation. I see nothing going out from HAProxy. Please note, I excluded ports 8007, 8008, 8009, because I am already exposing other services there, BUT I see a random port on Nomad, which is consistent. With consistent I mean, it's always 23185 (I guess, if I restart Nomad this port changes). IMO it's looking like Nomad is sending this port number to haproxy (in fact this port is in the range of ports used by Nomad). Then, HAProxy runs a check against that ports, which fails miserably.
# tcpdump '(host <MY-NET-IP-HERE>.234 or host <MY-NET-IP-HERE>.235 or host <MY-NET-IP-HERE>.236) && (not port 8007 && not port 8008 && not port 8009)'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:26:21.701719 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [S], seq 1996053247, win 32767, options [mss 1460,sackOK,TS val 4111717997 ecr 0,nop,wscale 11], length 0
10:26:21.702063 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [S.], seq 76913565, ack 1996053248, win 32767, options [mss 1460,sackOK,TS val 2077757155 ecr 4111717997,nop,wscale 11], length 0
10:26:21.702095 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [.], ack 1, win 16, options [nop,nop,TS val 4111717998 ecr 2077757155], length 0
10:26:21.702268 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [P.], seq 1:38, ack 1, win 16, options [nop,nop,TS val 4111717998 ecr 2077757155], length 37
10:26:21.702602 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [.], ack 38, win 16, options [nop,nop,TS val 2077757155 ecr 4111717998], length 0
10:26:21.702840 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [P.], seq 1:1382, ack 38, win 16, options [nop,nop,TS val 2077757155 ecr 4111717998], length 1381
10:26:21.702863 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [.], ack 1382, win 18, options [nop,nop,TS val 4111717999 ecr 2077757155], length 0
10:26:21.702941 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [F.], seq 38, ack 1382, win 18, options [nop,nop,TS val 4111717999 ecr 2077757155], length 0
10:26:21.702966 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [F.], seq 1382, ack 38, win 16, options [nop,nop,TS val 2077757155 ecr 4111717998], length 0
10:26:21.702976 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [.], ack 1383, win 18, options [nop,nop,TS val 4111717999 ecr 2077757155], length 0
10:26:21.703090 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [.], ack 39, win 16, options [nop,nop,TS val 2077757156 ecr 4111717999], length 0
10:26:21.703111 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [R], seq 1996053286, win 0, length 0
IMO it's looking like Nomad is sending this port number to haproxy (in fact this port is in the range of ports used by Nomad).
The tcpdump is showing that HAProxy is sending a SYN
to that high port and then Nomad is ACK
'ing on that port. But that doesn't explain why HAProxy has that port to begin with. You said the SRV
query was working as expected though...
SRV
record in Consul and the response?1st question (SRV query):
the query is not created by me.
HAProxy backends are normally configured through the server
directive.
But for SRV queries in HAProxy is being used the server-template
, and haproxy gets back the values in the SRV record. As written in the first message, this is the way it's configured:
server-template nomad 3 _nomad._http.service.<MY-DOMAIN> check inter 10s resolvers consul
directive | meaning |
---|---|
server-template |
it's the HAProxy directive for SRV (please refer to haproxy documentation) |
nomad |
it's just an identifier, showing in logs and dashboard |
3 |
the expected number of backends (it will also be appended to the above identifier |
_nomad._http.service.<MY-DOMAIN> |
it's the SRV record |
check inter 10 |
is the check interval against the backends |
resolver consul |
normally haproxy looks up for DNS records only when it boots, but it's possible to specify a different resolver stanza with DNS TTL fit for your Consul configuration. |
second question (Consul catalog for the service):
[
{
"ID": "a4e0481e-c48c-403f-070f-856bc43f336e",
"Node": "prod-nomad01",
"Address": "xxx.xxx.xxx.234",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.234",
"lan_ipv4": "xxx.xxx.xxx.234",
"wan": "xxx.xxx.xxx.234",
"wan_ipv4": "xxx.xxx.xxx.234"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-gzzjqsba6x3e23j7s4kxsqnagfyhr3sv",
"ServiceName": "nomad",
"ServiceTags": [
"rpc"
],
"ServiceAddress": "prod-nomad01.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4647,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111686825,
"ModifyIndex": 111686825
},
{
"ID": "a4e0481e-c48c-403f-070f-856bc43f336e",
"Node": "prod-nomad01",
"Address": "xxx.xxx.xxx.234",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.234",
"lan_ipv4": "xxx.xxx.xxx.234",
"wan": "xxx.xxx.xxx.234",
"wan_ipv4": "xxx.xxx.xxx.234"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-k5oae5thrtxnlmhbxj3zgbe23d5vl4gr",
"ServiceName": "nomad",
"ServiceTags": [
"http"
],
"ServiceAddress": "prod-nomad01.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4646,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111686830,
"ModifyIndex": 111686830
},
{
"ID": "a4e0481e-c48c-403f-070f-856bc43f336e",
"Node": "prod-nomad01",
"Address": "xxx.xxx.xxx.234",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.234",
"lan_ipv4": "xxx.xxx.xxx.234",
"wan": "xxx.xxx.xxx.234",
"wan_ipv4": "xxx.xxx.xxx.234"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-vi7c7s3qnipohi5nxgfrcbo6hasrjdrk",
"ServiceName": "nomad",
"ServiceTags": [
"serf"
],
"ServiceAddress": "prod-nomad01.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4648,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111686829,
"ModifyIndex": 111686829
},
{
"ID": "5c4b69ed-cbea-8e91-00a5-ec8d94f966d2",
"Node": "prod-nomad02",
"Address": "xxx.xxx.xxx.235",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.235",
"lan_ipv4": "xxx.xxx.xxx.235",
"wan": "xxx.xxx.xxx.235",
"wan_ipv4": "xxx.xxx.xxx.235"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-clmdce2fmwd777naa7easl7vlwjax3pe",
"ServiceName": "nomad",
"ServiceTags": [
"rpc"
],
"ServiceAddress": "prod-nomad02.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4647,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111686947,
"ModifyIndex": 111686947
},
{
"ID": "5c4b69ed-cbea-8e91-00a5-ec8d94f966d2",
"Node": "prod-nomad02",
"Address": "xxx.xxx.xxx.235",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.235",
"lan_ipv4": "xxx.xxx.xxx.235",
"wan": "xxx.xxx.xxx.235",
"wan_ipv4": "xxx.xxx.xxx.235"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-nklz5xzqvznzkux72vemwbtfsvu47iba",
"ServiceName": "nomad",
"ServiceTags": [
"serf"
],
"ServiceAddress": "prod-nomad02.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4648,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111686945,
"ModifyIndex": 111686945
},
{
"ID": "5c4b69ed-cbea-8e91-00a5-ec8d94f966d2",
"Node": "prod-nomad02",
"Address": "xxx.xxx.xxx.235",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.235",
"lan_ipv4": "xxx.xxx.xxx.235",
"wan": "xxx.xxx.xxx.235",
"wan_ipv4": "xxx.xxx.xxx.235"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-ohm6ozvkvh5idfgjzbpqtcxav2yk275h",
"ServiceName": "nomad",
"ServiceTags": [
"http"
],
"ServiceAddress": "prod-nomad02.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4646,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111686946,
"ModifyIndex": 111686946
},
{
"ID": "2be32247-cf4f-4aec-8359-91b2de09360f",
"Node": "prod-nomad03",
"Address": "xxx.xxx.xxx.236",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.236",
"lan_ipv4": "xxx.xxx.xxx.236",
"wan": "xxx.xxx.xxx.236",
"wan_ipv4": "xxx.xxx.xxx.236"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-3ub5lpsauwpvxf3shmk7vndcfwg567yp",
"ServiceName": "nomad",
"ServiceTags": [
"rpc"
],
"ServiceAddress": "prod-nomad03.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4647,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111687164,
"ModifyIndex": 111687164
},
{
"ID": "2be32247-cf4f-4aec-8359-91b2de09360f",
"Node": "prod-nomad03",
"Address": "xxx.xxx.xxx.236",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.236",
"lan_ipv4": "xxx.xxx.xxx.236",
"wan": "xxx.xxx.xxx.236",
"wan_ipv4": "xxx.xxx.xxx.236"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-4767cuvysodw3xsoiqv4jxp5tyv6dg7w",
"ServiceName": "nomad",
"ServiceTags": [
"serf"
],
"ServiceAddress": "prod-nomad03.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4648,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111687163,
"ModifyIndex": 111687163
},
{
"ID": "2be32247-cf4f-4aec-8359-91b2de09360f",
"Node": "prod-nomad03",
"Address": "xxx.xxx.xxx.236",
"Datacenter": "geant",
"TaggedAddresses": {
"lan": "xxx.xxx.xxx.236",
"lan_ipv4": "xxx.xxx.xxx.236",
"wan": "xxx.xxx.xxx.236",
"wan_ipv4": "xxx.xxx.xxx.236"
},
"NodeMeta": {
"consul-network-segment": ""
},
"ServiceKind": "",
"ServiceID": "_nomad-server-socoweynoegow24o2rhl72tt3qqam47z",
"ServiceName": "nomad",
"ServiceTags": [
"http"
],
"ServiceAddress": "prod-nomad03.domain.org",
"ServiceWeights": {
"Passing": 1,
"Warning": 1
},
"ServiceMeta": {
"external-source": "nomad"
},
"ServicePort": 4646,
"ServiceEnableTagOverride": false,
"ServiceProxy": {
"MeshGateway": {},
"Expose": {}
},
"ServiceConnect": {},
"CreateIndex": 111687165,
"ModifyIndex": 111687165
}
]
3rd question (Nomad configuration):
data_dir = "/var/nomad"
datacenter = "geant"
bind_addr = "0.0.0.0" # the default
advertise {
# Defaults to the first private IP address.
http = "prod-nomad01.domain.org"
rpc = "prod-nomad01.domain.org"
#serf = "prod-nomad01.domain.org:5648" # non-default ports may be specified
serf = "prod-nomad01.domain.org" # non-default ports may be specified
}
server {
enabled = true
bootstrap_expect = 3
}
client {
enabled = true
network_speed = 1000
reserved {
disk = 4096
}
options {
"driver.raw_exec.enable" = "1"
}
}
consul {
address = "127.0.0.1:8500"
# Enabling the server and client to bootstrap using Consul.
server_auto_join = true
client_auto_join = true
token = "xxxxxxxx-xxxxx-xxxx-xxxx-xxxxxxxxxxxxxxx"
}
the query is not created by me.
Right, the HAProxy makes that query. But you said "I tried to lookup for the SRV record: _nomad._http.service.<my-domain>
and it works as expected" so I was trying to ask what query you were using and what the result was for that. When you make that SRV query to check, are you sure you're using the same nameserver configuration as HAProxy?
Consul catalog for the service
Ok, that all looks right to me. All the http ports are set to 4646 as I'd expect, and not 23185.
advertise { # Defaults to the first private IP address. http = "prod-nomad01.domain.org" rpc = "prod-nomad01.domain.org" #serf = "prod-nomad01.domain.org:5648" # non-default ports may be specified serf = "prod-nomad01.domain.org" # non-default ports may be specified }
The advertise
block in your config is a little unusual to me. Typically I'd expect to see a sockaddr template if you want to advertise the first private IP address. Something like: "{{ GetPrivateIP }}"
. Using a DNS name complicates things a bit because you need to make sure you've got DNS registered somewhere for the Nomad server to look up before it's registered itself in Consul. Given that the Consul records look right to me, it looks like this is working out for you but it might be an extra complication you don't need.
understood. Every server in the datacenter use this resolver (with HAProxy I only need to tweak the TTL). I have Bind in front of Consul and this is what the query returns (it includes the port 4646):
_nomad._http.service.<my-domain>. 5 IN SRV 1 1 4646 prod-nomad01.<my-domain>.
_nomad._http.service.<my-domain>. 5 IN SRV 1 1 4646 prod-nomad02.<my-domain>.
_nomad._http.service.<my-domain>. 5 IN SRV 1 1 4646 prod-nomad03.<my-domain>.
Are you saying that I could use "{{ GetPrivateIP }}"
instead of "prod-nomad01.domain.org"
? That's pretty cool and I avoid templating with puppet. I'll try that one.
p.s.: just to clarify, that the use of an IP or host there, won't solve this issue, because the name that I'm using in Nomad configuration, does not come from service discovery, but from normal DNS resolutin.
Are you saying that I could use
"{{ GetPrivateIP }}"
instead of"prod-nomad01.domain.org"
? That's pretty cool and I avoid templating with puppet. I'll try that one.
Yup!
p.s.: just to clarify, that the use of an IP or host there, won't solve this issue, because the name that I'm using in Nomad configuration, does not come from service discovery, but from normal DNS resolutin.
Yeah, understood on that. At this point I'm really not sure what the issue can be, as the SRV
records you're getting look correct as far as I can tell. Have you had any luck with the HAProxy folks?
no, they still need to triage the ticket.
I'm doing some issue cleanup and it doesn't seem like there's much else we can do here other than wait for the HAProxy folks. Going to close out this issue but please feel free to re-open if you hear back from them.
I confirm the bug was in HAProxy and a patch has been submitted to the project today with backport instruction to HAProxy 2.2 and above where this problem occurs (happened since we use the additional records provided by the DNS server)
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
0.11.3
Operating system and Environment details
CentOS 7
Issue
I could not configured nomad Web UI using the SRV provided by Nomad.
Reproduction steps
I am trying to configure haproxy to serve the web UI of nomad, through its SRV record. I found out that there are several tags available, one of which is
http
, listening on port4646
. I tried to lookup for the SRV record:_nomad._http.service.<my-domain>
and it works as expected. Then I add this to the backend:this same stanza works for containers and other type of Nomad jobs: this is the only case where it's not working. in HAProxy logs I see either:
in the logs I have seen either this:
Server nomad/nomad1 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
or this:
Server nomad/nomad1 was DOWN and now enters maintenance (unspecified DNS error).
Is this supposed to work? (meanwhile I can also check if haproxy is limited to
_tcp
tag only (not even_udp
, because it's not supported by haproxy)