hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.9k stars 1.95k forks source link

consul services registered with unexpected IP #18917

Closed 116davinder closed 4 months ago

116davinder commented 1 year ago

Nomad version

Nomad Server : 1.6.2 Nomad Client: 1.6.1 Consul Server: 1.16.2 Consul Client: 1.16.1

Operating system and Environment details

Ubuntu 20.04.x LTS

Issue

Nomad Service Registration/Task-Check in Consul with Different IP then Node IP.

Example: Nomad Services are either Not Deregistered / Registered with Different IP.

$ nslookup 10.206.206.221
221.206.206.10.in-addr.arpa     name = prd-xxxxxxx-e16-ix-1.xxxxx.com.
221.206.206.10.in-addr.arpa     name = prd-xxxxxxx-e16-ix-1.
Consul-Node-Health-View Consul-Service-Instance-View Nomad-Node-Status

Reproduction steps

I don't know yet. if I restart consul service on same node, it get fixed for some time then re-appear after several days or weeks.

Expected Result

Consul Service / Node UI should show correct Node IP, w.r.t. Health Check or Task Checks.

Actual Result

As shown in above screenshots.

Job file (if appropriate)

variable "version" {
  type    = string
  default = "1.6.1"
}

variable "deployment" {
  type        = string
  description = "Deployment type [prd|tst]"
}

job "node-exporter" {

  constraint {
    distinct_hosts = true
  }

  constraint {
    attribute = meta.deployment
    value     = "${var.deployment}"
  }

  datacenters = ["*"]
  type        = "system"

  update {
    max_parallel = 3
    auto_revert  = true
  }

  group "node-exporter" {

    network {
      port "metrics" {
        static = 9100
      }
    }

    task "node-exporter" {

      driver = "exec"

      config {
        command = "./local/node_exporter-${var.version}.linux-amd64/node_exporter"
      }
      resources {
        cpu    = 50
        memory = 64
      }
      artifact {
        source = "https://github.com/prometheus/node_exporter/releases/download/v${var.version}/node_exporter-${var.version}.linux-amd64.tar.gz"
      }

      service {
        name     = "node-exporter"
        tags     = ["node_exporter"]
        port     = "metrics"
        provider = "consul"
        check {
          type     = "tcp"
          port     = "metrics"
          interval = "10s"
          timeout  = "5s"
        }
      }
    }
  }
  meta {
    deployment            = var.deployment
  }
}

Nomad Server logs (if appropriate)

I only have info logs which have nothing in it.

Nomad Client logs (if appropriate)

/var/log/nomad# cat nomad.log from 10.206.206.200

{"@level":"error","@message":"Newer Nomad version available: 1.6.2 (currently running: 1.6.1)","@module":"agent","@timestamp":"2023-10-29T18:24:43.065998Z"}
/var/log/nomad# cat nomad-1698603883066058283.log
/var/log/nomad# cat nomad-1698511075693479662.log
{"@level":"error","@message":"Newer Nomad version available: 1.6.2 (currently running: 1.6.1)","@module":"agent","@timestamp":"2023-10-27T01:48:47.547948Z"}
{"@level":"error","@message":"Newer Nomad version available: 1.6.2 (currently running: 1.6.1)","@module":"agent","@timestamp":"2023-10-27T21:45:17.834119Z"}
root@prd-xxxxx-e13-ic-1:/home/admin-ssh# nomad monitor -log-level=TRACE -json=true
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:11:20.240296Z","period":13689883390}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:11:26.089125Z","duration":1425578,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:11:33.728400Z","duration":1868535,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:11:33.931612Z","period":11701586754}
{"@level":"trace","@message":"execute sync","@module":"consul.sync","@timestamp":"2023-10-30T16:11:36.789934Z","reason":0}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:11:45.635389Z","period":11602568641}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:11:56.089074Z","duration":1409479,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:11:57.240005Z","period":15339098779}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:12:03.728284Z","duration":1493670,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"trace","@message":"execute sync","@module":"consul.sync","@timestamp":"2023-10-30T16:12:06.794767Z","reason":0}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:12:12.581415Z","period":10633846583}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:12:23.217604Z","period":18455962303}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:12:26.089095Z","duration":1420217,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:12:33.728452Z","duration":1426473,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"trace","@message":"execute sync","@module":"consul.sync","@timestamp":"2023-10-30T16:12:36.797400Z","reason":0}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:12:41.675797Z","period":18926252316}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:12:56.089139Z","duration":1350132,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"trace","@message":"next heartbeat","@module":"client","@timestamp":"2023-10-30T16:13:00.604571Z","period":13330857199}
{"@level":"debug","@message":"request complete","@module":"http","@timestamp":"2023-10-30T16:13:03.728187Z","duration":1503290,"method":"GET","path":"/v1/metrics?format=prometheus"}
{"@level":"trace","@message":"execute sync","@module":"consul.sync","@timestamp":"2023-10-30T16:13:06.802149Z","reason":0}

Consul Client logs (if appropriate)

/var/log/consul/consul-1698652779758923384.log from 10.206.206.200

{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:29:52.432550Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:31:05.929432Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:32:18.178688Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:33:48.158341Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:35:30.285120Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:36:34.052432Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:38:25.574282Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
{"@level":"info","@message":"Synced service","@module":"agent","@timestamp":"2023-10-30T14:40:01.466436Z","service":"_nomad-client-xhopga5ts3lx3nwphl7x4bluginswpgw"}
lgfa29 commented 11 months ago

Hi @116davinder 👋

Do you have any idea where the IP 10.206.206.221 may be coming from? Do you have any other machine in your cluster with this IP, or would it be possible for 10.206.206.200 to have changed its IP for some reason?

lgfa29 commented 11 months ago

Just noticed that health checks are also passing, which I find a bit surprising 🤔

I was wondering if this could related to https://github.com/hashicorp/nomad/issues/16616, but I don't think it is for this reason.

Do you see these services being registered and then deregistered or they stay consistent? Because I think I would also expect the Consul's anti-entropy logic to kick-in and remove these services.

Another guess at this point would be for the node IP to change while the allocation is running, because I don't exactly remember if we read the node IP all the time of if it's associated with the allocation at creation time.

116davinder commented 11 months ago

@lgfa29 , I do see both IPs being registered in consul under different agents, since this service is system, it will run on all machines so the consul check should even though it will query/call another machines rest endpoint.

Hi @116davinder 👋

Do you have any idea where the IP 10.206.206.221 may be coming from? Do you have any other machine in your cluster with this IP, or would it be possible for 10.206.206.200 to have changed its IP for some reason?

116davinder commented 11 months ago

Another guess at this point would be for the node IP to change while the allocation is running, because I don't exactly remember if we read the node IP all the time of if it's associated with the allocation at creation time.

so you are saying that if a given node/vm is being recreated, I can see this behaviour because allocation doesn't re-read node IP when it starting the task again on same node aka hostname (with different IP).

Currently, my terraform is bound with several network ranges/cidr blocks and it is fairly possible that given VM/node is recreated and got a different IP but same hostname.

My current assumption is that when nomad starts the task on given node, it should ask for Node IP or similar information from consul while registering a given task on that consul agent but seems like that's not the case, it have some cache or history.

May be unrelated but actually my production setup still have this issue as well: https://github.com/hashicorp/nomad/issues/17079 [ Only nomad / consul servers are upgraded latest, not the workers ]

116davinder commented 11 months ago

Check this Consul server says that i have 66 instances of node-expoter

image

but actually, i have less running

image
116davinder commented 11 months ago

FYI: Version Changes from original description Nomad Server : 1.6.3 <---- change Nomad Client: 1.6.1 Consul Server: 1.16.3 <---- change Consul Client: 1.16.1

116davinder commented 11 months ago

Hi @116davinder 👋

Do you have any idea where the IP 10.206.206.221 may be coming from? Do you have any other machine in your cluster with this IP, or would it be possible for 10.206.206.200 to have changed its IP for some reason?

I still have these IPs which are reattached/reused in last month or so but issue doesn't exist anymore on these machines, now it moves to different machines.

lgfa29 commented 11 months ago

so you are saying that if a given node/vm is being recreated, I can see this behaviour because allocation doesn't re-read node IP when it starting the task again on same node aka hostname (with different IP).

That was a guess on my part, I'm not sure if that's the case. I know there are some values that get persisted in the allocation and may get stale, but again, it was mostly a guess to record the thought process.

since this service is system, it will run on all machines so the consul check should even though it will query/call another machines rest endpoint.

Ahh I missed that it was a service job. I think that makes sense now.

Check this Consul server says that i have 66 instances of node-expoter

Yeah, this is likely https://github.com/hashicorp/nomad/issues/16616.


Another question, are the Nomad agents configured to always connect to the their respective local Consul agents?

116davinder commented 11 months ago

Another question, are the Nomad agents configured to always connect to the their respective local Consul agents?

yes

tgross commented 4 months ago

Doing a little issue cleanup here... the most likely explanation here is related to reusing the host with the same hostname but different IP address, and that the Consul server is retaining information about the Consul agent.

The thing we never looked at here is what Nomad thinks the IP address of the allocation is. There's two places I'd want to look here:

I strongly suspect the root cause here is that the host is getting reused with a new IP address but without fully wiping the client state for both Nomad and the local Consul agent.

I see this issue has been open for a long time without further debugging. Sorry @116davinder that this fell through the cracks. I'm going to close it for now but if you are still seeing this problem and want to continue debugging, let me know and I'll be happy to re-open.