hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.81k stars 1.94k forks source link

could not configure HAProxy to connect to _nomad._http.service.consul #8250

Closed maxadamo closed 4 years ago

maxadamo commented 4 years ago

Nomad version

0.11.3

Operating system and Environment details

CentOS 7

Issue

I could not configured nomad Web UI using the SRV provided by Nomad.

Reproduction steps

I am trying to configure haproxy to serve the web UI of nomad, through its SRV record. I found out that there are several tags available, one of which is http, listening on port 4646. I tried to lookup for the SRV record: _nomad._http.service.<my-domain> and it works as expected. Then I add this to the backend:

backend nomad
  mode http
  option httpchk GET /ui/servers
  http-check expect status 307
  timeout connect 10s
  timeout server 1m
  balance source
  server-template nomad 3 _nomad._http.service.<MY-DOMAIN> check inter 10s resolvers consul

this same stanza works for containers and other type of Nomad jobs: this is the only case where it's not working. in HAProxy logs I see either:

in the logs I have seen either this: Server nomad/nomad1 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.

or this: Server nomad/nomad1 was DOWN and now enters maintenance (unspecified DNS error).

Is this supposed to work? (meanwhile I can also check if haproxy is limited to _tcp tag only (not even _udp, because it's not supported by haproxy)

maxadamo commented 4 years ago

see issue => https://github.com/haproxy/haproxy/issues/709

tgross commented 4 years ago

Hi @maxadamo! Sorry to hear you're running into trouble! I'm not much of a HAProxy expert, but with respect to the two different error messages:

maxadamo commented 4 years ago

@tgross good idea. I'm leaving tcpdump port 4646 on Nomad and I see nothing coming coming in.
Meanwhile, in HAProxy I see:

Server nomad/nomad3 is DOWN, reason: Socket error, check duration: 0ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
maxadamo commented 4 years ago

In addition to the above situation. I see nothing going out from HAProxy. Please note, I excluded ports 8007, 8008, 8009, because I am already exposing other services there, BUT I see a random port on Nomad, which is consistent. With consistent I mean, it's always 23185 (I guess, if I restart Nomad this port changes). IMO it's looking like Nomad is sending this port number to haproxy (in fact this port is in the range of ports used by Nomad). Then, HAProxy runs a check against that ports, which fails miserably.

# tcpdump '(host <MY-NET-IP-HERE>.234 or host <MY-NET-IP-HERE>.235 or host <MY-NET-IP-HERE>.236) && (not port 8007 && not port 8008 && not port 8009)'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:26:21.701719 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [S], seq 1996053247, win 32767, options [mss 1460,sackOK,TS val 4111717997 ecr 0,nop,wscale 11], length 0
10:26:21.702063 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [S.], seq 76913565, ack 1996053248, win 32767, options [mss 1460,sackOK,TS val 2077757155 ecr 4111717997,nop,wscale 11], length 0
10:26:21.702095 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [.], ack 1, win 16, options [nop,nop,TS val 4111717998 ecr 2077757155], length 0
10:26:21.702268 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [P.], seq 1:38, ack 1, win 16, options [nop,nop,TS val 4111717998 ecr 2077757155], length 37
10:26:21.702602 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [.], ack 38, win 16, options [nop,nop,TS val 2077757155 ecr 4111717998], length 0
10:26:21.702840 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [P.], seq 1:1382, ack 38, win 16, options [nop,nop,TS val 2077757155 ecr 4111717998], length 1381
10:26:21.702863 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [.], ack 1382, win 18, options [nop,nop,TS val 4111717999 ecr 2077757155], length 0
10:26:21.702941 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [F.], seq 38, ack 1382, win 18, options [nop,nop,TS val 4111717999 ecr 2077757155], length 0
10:26:21.702966 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [F.], seq 1382, ack 38, win 16, options [nop,nop,TS val 2077757155 ecr 4111717998], length 0
10:26:21.702976 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [.], ack 1383, win 18, options [nop,nop,TS val 4111717999 ecr 2077757155], length 0
10:26:21.703090 IP prod-nomad03.MYDOMAIN.org.23185 > prod-haproxy02.MYDOMAIN.org.47484: Flags [.], ack 39, win 16, options [nop,nop,TS val 2077757156 ecr 4111717999], length 0
10:26:21.703111 IP prod-haproxy02.MYDOMAIN.org.47484 > prod-nomad03.MYDOMAIN.org.23185: Flags [R], seq 1996053286, win 0, length 0
tgross commented 4 years ago

IMO it's looking like Nomad is sending this port number to haproxy (in fact this port is in the range of ports used by Nomad).

The tcpdump is showing that HAProxy is sending a SYN to that high port and then Nomad is ACK'ing on that port. But that doesn't explain why HAProxy has that port to begin with. You said the SRV query was working as expected though...

maxadamo commented 4 years ago

1st question (SRV query): the query is not created by me. HAProxy backends are normally configured through the server directive. But for SRV queries in HAProxy is being used the server-template, and haproxy gets back the values in the SRV record. As written in the first message, this is the way it's configured:

server-template nomad 3 _nomad._http.service.<MY-DOMAIN> check inter 10s resolvers consul
directive meaning
server-template it's the HAProxy directive for SRV (please refer to haproxy documentation)
nomad it's just an identifier, showing in logs and dashboard
3 the expected number of backends (it will also be appended to the above identifier
_nomad._http.service.<MY-DOMAIN> it's the SRV record
check inter 10 is the check interval against the backends
resolver consul normally haproxy looks up for DNS records only when it boots, but it's possible to specify a different resolver stanza with DNS TTL fit for your Consul configuration.
maxadamo commented 4 years ago

second question (Consul catalog for the service):

[
  {
    "ID": "a4e0481e-c48c-403f-070f-856bc43f336e",
    "Node": "prod-nomad01",
    "Address": "xxx.xxx.xxx.234",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.234",
      "lan_ipv4": "xxx.xxx.xxx.234",
      "wan": "xxx.xxx.xxx.234",
      "wan_ipv4": "xxx.xxx.xxx.234"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-gzzjqsba6x3e23j7s4kxsqnagfyhr3sv",
    "ServiceName": "nomad",
    "ServiceTags": [
      "rpc"
    ],
    "ServiceAddress": "prod-nomad01.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4647,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111686825,
    "ModifyIndex": 111686825
  },
  {
    "ID": "a4e0481e-c48c-403f-070f-856bc43f336e",
    "Node": "prod-nomad01",
    "Address": "xxx.xxx.xxx.234",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.234",
      "lan_ipv4": "xxx.xxx.xxx.234",
      "wan": "xxx.xxx.xxx.234",
      "wan_ipv4": "xxx.xxx.xxx.234"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-k5oae5thrtxnlmhbxj3zgbe23d5vl4gr",
    "ServiceName": "nomad",
    "ServiceTags": [
      "http"
    ],
    "ServiceAddress": "prod-nomad01.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4646,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111686830,
    "ModifyIndex": 111686830
  },
  {
    "ID": "a4e0481e-c48c-403f-070f-856bc43f336e",
    "Node": "prod-nomad01",
    "Address": "xxx.xxx.xxx.234",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.234",
      "lan_ipv4": "xxx.xxx.xxx.234",
      "wan": "xxx.xxx.xxx.234",
      "wan_ipv4": "xxx.xxx.xxx.234"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-vi7c7s3qnipohi5nxgfrcbo6hasrjdrk",
    "ServiceName": "nomad",
    "ServiceTags": [
      "serf"
    ],
    "ServiceAddress": "prod-nomad01.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4648,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111686829,
    "ModifyIndex": 111686829
  },
  {
    "ID": "5c4b69ed-cbea-8e91-00a5-ec8d94f966d2",
    "Node": "prod-nomad02",
    "Address": "xxx.xxx.xxx.235",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.235",
      "lan_ipv4": "xxx.xxx.xxx.235",
      "wan": "xxx.xxx.xxx.235",
      "wan_ipv4": "xxx.xxx.xxx.235"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-clmdce2fmwd777naa7easl7vlwjax3pe",
    "ServiceName": "nomad",
    "ServiceTags": [
      "rpc"
    ],
    "ServiceAddress": "prod-nomad02.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4647,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111686947,
    "ModifyIndex": 111686947
  },
  {
    "ID": "5c4b69ed-cbea-8e91-00a5-ec8d94f966d2",
    "Node": "prod-nomad02",
    "Address": "xxx.xxx.xxx.235",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.235",
      "lan_ipv4": "xxx.xxx.xxx.235",
      "wan": "xxx.xxx.xxx.235",
      "wan_ipv4": "xxx.xxx.xxx.235"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-nklz5xzqvznzkux72vemwbtfsvu47iba",
    "ServiceName": "nomad",
    "ServiceTags": [
      "serf"
    ],
    "ServiceAddress": "prod-nomad02.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4648,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111686945,
    "ModifyIndex": 111686945
  },
  {
    "ID": "5c4b69ed-cbea-8e91-00a5-ec8d94f966d2",
    "Node": "prod-nomad02",
    "Address": "xxx.xxx.xxx.235",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.235",
      "lan_ipv4": "xxx.xxx.xxx.235",
      "wan": "xxx.xxx.xxx.235",
      "wan_ipv4": "xxx.xxx.xxx.235"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-ohm6ozvkvh5idfgjzbpqtcxav2yk275h",
    "ServiceName": "nomad",
    "ServiceTags": [
      "http"
    ],
    "ServiceAddress": "prod-nomad02.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4646,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111686946,
    "ModifyIndex": 111686946
  },
  {
    "ID": "2be32247-cf4f-4aec-8359-91b2de09360f",
    "Node": "prod-nomad03",
    "Address": "xxx.xxx.xxx.236",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.236",
      "lan_ipv4": "xxx.xxx.xxx.236",
      "wan": "xxx.xxx.xxx.236",
      "wan_ipv4": "xxx.xxx.xxx.236"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-3ub5lpsauwpvxf3shmk7vndcfwg567yp",
    "ServiceName": "nomad",
    "ServiceTags": [
      "rpc"
    ],
    "ServiceAddress": "prod-nomad03.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4647,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111687164,
    "ModifyIndex": 111687164
  },
  {
    "ID": "2be32247-cf4f-4aec-8359-91b2de09360f",
    "Node": "prod-nomad03",
    "Address": "xxx.xxx.xxx.236",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.236",
      "lan_ipv4": "xxx.xxx.xxx.236",
      "wan": "xxx.xxx.xxx.236",
      "wan_ipv4": "xxx.xxx.xxx.236"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-4767cuvysodw3xsoiqv4jxp5tyv6dg7w",
    "ServiceName": "nomad",
    "ServiceTags": [
      "serf"
    ],
    "ServiceAddress": "prod-nomad03.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4648,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111687163,
    "ModifyIndex": 111687163
  },
  {
    "ID": "2be32247-cf4f-4aec-8359-91b2de09360f",
    "Node": "prod-nomad03",
    "Address": "xxx.xxx.xxx.236",
    "Datacenter": "geant",
    "TaggedAddresses": {
      "lan": "xxx.xxx.xxx.236",
      "lan_ipv4": "xxx.xxx.xxx.236",
      "wan": "xxx.xxx.xxx.236",
      "wan_ipv4": "xxx.xxx.xxx.236"
    },
    "NodeMeta": {
      "consul-network-segment": ""
    },
    "ServiceKind": "",
    "ServiceID": "_nomad-server-socoweynoegow24o2rhl72tt3qqam47z",
    "ServiceName": "nomad",
    "ServiceTags": [
      "http"
    ],
    "ServiceAddress": "prod-nomad03.domain.org",
    "ServiceWeights": {
      "Passing": 1,
      "Warning": 1
    },
    "ServiceMeta": {
      "external-source": "nomad"
    },
    "ServicePort": 4646,
    "ServiceEnableTagOverride": false,
    "ServiceProxy": {
      "MeshGateway": {},
      "Expose": {}
    },
    "ServiceConnect": {},
    "CreateIndex": 111687165,
    "ModifyIndex": 111687165
  }
]
maxadamo commented 4 years ago

3rd question (Nomad configuration):

data_dir = "/var/nomad"

datacenter = "geant"

bind_addr = "0.0.0.0" # the default

advertise {
  # Defaults to the first private IP address.
  http = "prod-nomad01.domain.org"
  rpc  = "prod-nomad01.domain.org"

  #serf = "prod-nomad01.domain.org:5648" # non-default ports may be specified
  serf = "prod-nomad01.domain.org" # non-default ports may be specified
}

server {
  enabled          = true
  bootstrap_expect = 3
}

client {
  enabled       = true
  network_speed = 1000
  reserved {
    disk = 4096
  }
  options {
    "driver.raw_exec.enable" = "1"
  }
}

consul {
  address = "127.0.0.1:8500"

  # Enabling the server and client to bootstrap using Consul.
  server_auto_join = true
  client_auto_join = true
  token            = "xxxxxxxx-xxxxx-xxxx-xxxx-xxxxxxxxxxxxxxx"
}
tgross commented 4 years ago

the query is not created by me.

Right, the HAProxy makes that query. But you said "I tried to lookup for the SRV record: _nomad._http.service.<my-domain> and it works as expected" so I was trying to ask what query you were using and what the result was for that. When you make that SRV query to check, are you sure you're using the same nameserver configuration as HAProxy?

Consul catalog for the service

Ok, that all looks right to me. All the http ports are set to 4646 as I'd expect, and not 23185.

advertise {
 # Defaults to the first private IP address.
 http = "prod-nomad01.domain.org"
 rpc  = "prod-nomad01.domain.org"

 #serf = "prod-nomad01.domain.org:5648" # non-default ports may be specified
 serf = "prod-nomad01.domain.org" # non-default ports may be specified
}

The advertise block in your config is a little unusual to me. Typically I'd expect to see a sockaddr template if you want to advertise the first private IP address. Something like: "{{ GetPrivateIP }}". Using a DNS name complicates things a bit because you need to make sure you've got DNS registered somewhere for the Nomad server to look up before it's registered itself in Consul. Given that the Consul records look right to me, it looks like this is working out for you but it might be an extra complication you don't need.

maxadamo commented 4 years ago

understood. Every server in the datacenter use this resolver (with HAProxy I only need to tweak the TTL). I have Bind in front of Consul and this is what the query returns (it includes the port 4646):

_nomad._http.service.<my-domain>. 5 IN  SRV 1 1 4646 prod-nomad01.<my-domain>.
_nomad._http.service.<my-domain>. 5 IN  SRV 1 1 4646 prod-nomad02.<my-domain>.
_nomad._http.service.<my-domain>. 5 IN  SRV 1 1 4646 prod-nomad03.<my-domain>.

Are you saying that I could use "{{ GetPrivateIP }}" instead of "prod-nomad01.domain.org"? That's pretty cool and I avoid templating with puppet. I'll try that one.

maxadamo commented 4 years ago

p.s.: just to clarify, that the use of an IP or host there, won't solve this issue, because the name that I'm using in Nomad configuration, does not come from service discovery, but from normal DNS resolutin.

tgross commented 4 years ago

Are you saying that I could use "{{ GetPrivateIP }}" instead of "prod-nomad01.domain.org"? That's pretty cool and I avoid templating with puppet. I'll try that one.

Yup!

p.s.: just to clarify, that the use of an IP or host there, won't solve this issue, because the name that I'm using in Nomad configuration, does not come from service discovery, but from normal DNS resolutin.

Yeah, understood on that. At this point I'm really not sure what the issue can be, as the SRV records you're getting look correct as far as I can tell. Have you had any luck with the HAProxy folks?

maxadamo commented 4 years ago

no, they still need to triage the ticket.

tgross commented 4 years ago

I'm doing some issue cleanup and it doesn't seem like there's much else we can do here other than wait for the HAProxy folks. Going to close out this issue but please feel free to re-open if you hear back from them.

bedis commented 3 years ago

I confirm the bug was in HAProxy and a patch has been submitted to the project today with backport instruction to HAProxy 2.2 and above where this problem occurs (happened since we use the additional records provided by the DNS server)

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.