Closed 116davinder closed 4 months ago
Hi @116davinder 👋
Do you have any idea where the IP 10.206.206.221
may be coming from? Do you have any other machine in your cluster with this IP, or would it be possible for 10.206.206.200
to have changed its IP for some reason?
Just noticed that health checks are also passing, which I find a bit surprising 🤔
I was wondering if this could related to https://github.com/hashicorp/nomad/issues/16616, but I don't think it is for this reason.
Do you see these services being registered and then deregistered or they stay consistent? Because I think I would also expect the Consul's anti-entropy logic to kick-in and remove these services.
Another guess at this point would be for the node IP to change while the allocation is running, because I don't exactly remember if we read the node IP all the time of if it's associated with the allocation at creation time.
@lgfa29 , I do see both IPs being registered in consul under different agents, since this service is system, it will run on all machines so the consul check should even though it will query/call another machines rest endpoint.
Hi @116davinder 👋
Do you have any idea where the IP
10.206.206.221
may be coming from? Do you have any other machine in your cluster with this IP, or would it be possible for10.206.206.200
to have changed its IP for some reason?
Another guess at this point would be for the node IP to change while the allocation is running, because I don't exactly remember if we read the node IP all the time of if it's associated with the allocation at creation time.
so you are saying that if a given node/vm is being recreated, I can see this behaviour because allocation doesn't re-read node IP when it starting the task again on same node aka hostname (with different IP).
Currently, my terraform is bound with several network ranges/cidr blocks and it is fairly possible that given VM/node is recreated and got a different IP but same hostname.
My current assumption is that when nomad starts the task on given node, it should ask for Node IP or similar information from consul while registering a given task on that consul agent but seems like that's not the case, it have some cache or history.
May be unrelated but actually my production setup still have this issue as well: https://github.com/hashicorp/nomad/issues/17079 [ Only nomad / consul servers are upgraded latest, not the workers ]
Check this Consul server says that i have 66 instances of node-expoter
but actually, i have less running
FYI: Version Changes from original description Nomad Server : 1.6.3 <---- change Nomad Client: 1.6.1 Consul Server: 1.16.3 <---- change Consul Client: 1.16.1
Hi @116davinder 👋
Do you have any idea where the IP
10.206.206.221
may be coming from? Do you have any other machine in your cluster with this IP, or would it be possible for10.206.206.200
to have changed its IP for some reason?
I still have these IPs which are reattached/reused in last month or so but issue doesn't exist anymore on these machines, now it moves to different machines.
so you are saying that if a given node/vm is being recreated, I can see this behaviour because allocation doesn't re-read node IP when it starting the task again on same node aka hostname (with different IP).
That was a guess on my part, I'm not sure if that's the case. I know there are some values that get persisted in the allocation and may get stale, but again, it was mostly a guess to record the thought process.
since this service is system, it will run on all machines so the consul check should even though it will query/call another machines rest endpoint.
Ahh I missed that it was a service job. I think that makes sense now.
Check this Consul server says that i have 66 instances of node-expoter
Yeah, this is likely https://github.com/hashicorp/nomad/issues/16616.
Another question, are the Nomad agents configured to always connect to the their respective local Consul agents?
Another question, are the Nomad agents configured to always connect to the their respective local Consul agents?
yes
Doing a little issue cleanup here... the most likely explanation here is related to reusing the host with the same hostname but different IP address, and that the Consul server is retaining information about the Consul agent.
The thing we never looked at here is what Nomad thinks the IP address of the allocation is. There's two places I'd want to look here:
nomad alloc status $alloc_id
say the address is? This tells us what the Nomad server has been told.nomad operator client-state
command. We want to look at both the allocation's "network status" object and the node itself. This tells us what the Nomad client thinks, and what it should be transmitting to Consul.I strongly suspect the root cause here is that the host is getting reused with a new IP address but without fully wiping the client state for both Nomad and the local Consul agent.
I see this issue has been open for a long time without further debugging. Sorry @116davinder that this fell through the cracks. I'm going to close it for now but if you are still seeing this problem and want to continue debugging, let me know and I'll be happy to re-open.
Nomad version
Nomad Server : 1.6.2 Nomad Client: 1.6.1 Consul Server: 1.16.2 Consul Client: 1.16.1
Operating system and Environment details
Ubuntu 20.04.x LTS
Issue
Nomad Service Registration/Task-Check in Consul with Different IP then Node IP.
Example: Nomad Services are either
Not Deregistered
/Registered with Different IP
.Reproduction steps
I don't know yet. if I restart consul service on same node, it get fixed for some time then re-appear after several days or weeks.
Expected Result
Consul Service / Node UI should show correct Node IP, w.r.t. Health Check or Task Checks.
Actual Result
As shown in above screenshots.
Job file (if appropriate)
Nomad Server logs (if appropriate)
I only have info logs which have nothing in it.
Nomad Client logs (if appropriate)
/var/log/nomad# cat nomad.log from
10.206.206.200
Consul Client logs (if appropriate)
/var/log/consul/consul-1698652779758923384.log
from10.206.206.200