Open sebastiankr opened 3 years ago
Hi @sebastiankr ! Thanks for including the logs. I'm seeing few odd things - it looks as if the IP address changed between nomad restarts. The first log shows the server advertising 10.0.1.1, but the initial raft configuration being for 172.17.0.1 . Nomad server identity is tied around ip addresses, and such changes are disruptive to nomad operations. If the server has multiple interfaces, I'd suggest explicitly setting the server advertise addresses to a stable IP and trying again. If the workloads are ephemeral, you can also wipe the nomad data dir on restart.
For context, Nomad defaults to using ip address as the stable identifier for servers. If a server restarts with a new ip address, it will be considered a new one. Raft protocol 3 addresses this issue, by using a stable id as the identity so servers can keep their identity after ip address changes.
Thanks @notnoop! This got me on the right track. 172.17.0.1 is the docker0 device. 10.0.1.1 is my subnet that nomad should be using and was using before, even after restarting the service.
It seems that the logic of how the private network is selected has changed and the upgrade to 1.0.4 has changed the identity to the docker ip.
I nuked /opt/nomad/data/
and did explicitly set the advertise
ips and it is now running as expected.
Maybe advertise
should not be optional?
Or the network selection logic should be stable and always prefer class A networks over B over C?
Not sure if this is something nomad
should address.
Great! I'm glad that helped. I'm surprised that nomad picked docker0 interface, and that nomad picked different interfaces between runs. I'll examine that code flow and will see how we can improve the situation more.
Just so I don't miss an edge case, can I have the full output of the following commands ip addr
, ip link
, and ip route
. I wonder if a nomad heuristic went wrong in your case.
root@server-0:~# ip ad
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 96:X:X:X:X:X brd ff:ff:ff:ff:ff:ff
inet 116.X.X.X/32 scope global eth0
valid_lft forever preferred_lft forever
inet 116.X.X.X/32 scope global dynamic eth0
valid_lft 60772sec preferred_lft 60772sec
inet6 2a01:X:X:Xc::1/64 scope global
valid_lft forever preferred_lft forever
inet6 fe80::X:X:X:/64 scope link
valid_lft forever preferred_lft forever
3: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
link/ether 86:00:00:84:1c:37 brd ff:ff:ff:ff:ff:ff
inet 10.0.1.1/32 brd 10.0.1.1 scope global dynamic ens10
valid_lft 60775sec preferred_lft 60775sec
inet6 fe80::8400:ff:fe84:1c37/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:c9:8c:2f:05 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
root@server-0:~# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 96:X:X:X:X36 brd ff:ff:ff:ff:ff:ff
3: ens10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
link/ether 86:00:00:84:1c:37 brd ff:ff:ff:ff:ff:ff
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 02:42:c9:8c:2f:05 brd ff:ff:ff:ff:ff:ff
root@server-0:~# ip route
default via 172.X.1.1 dev eth0 proto dhcp src 116.X.X.X metric 100
10.0.0.0/16 via 10.0.0.1 dev ens10
10.0.0.1 dev ens10 scope link
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
172.X.1.1 dev eth0 proto dhcp scope link src 116.X.X.X metric 100
Hi there,
after upgrading nomad to 1.0.4, starting nomad via systemd fails to elect leader. Starting via
nomad agent -config /etc/nomad.d
continues to work fine. Stopping and starting after a reboot viasystemctl stop nomad && systemctl start nomad
works as well. Why is the leader election just failing after system restart?I am running nomad in single server mode. Nomad was installed via apt.
Nomad version
Nomad v1.0.4 (9294f35f9aa8dbb4acb6e85fa88e3e2534a3e41a)
nomad.hcl
nomad.service
Operating system and Environment details
Ubuntu 20.04.2 LTS (GNU/Linux 5.4.0-66-generic x86_64)
Logs after reboot
Logs after restarting via
systemctl stop nomad && systemctl start nomad