Open saqib-ahmed opened 6 years ago
Hello, i noticed this several month ago too. I think this started far before 1.4.5.
This would be nice to open a proper issue on keepalived repo too.
@BertrandGouny I did ask it in keepalived repo: https://github.com/acassen/keepalived/issues/512 The collaborator there said that it is a docker related issue and not related to keepalived. This is the response I got:
Your configuration appears to be using the default advert interval of 1 second. This means that when keepalived starts up, any VRRP instances (except those configured with priority 255) will wait approximately 3.8 seconds before transitioning to master (3 advert_int + advert_int (255 - priority)/255) (the 255 might be 256, I can't remember). Since 1.89 transitions to master at 13:18:52.6, it must have been in backup mode since 13:18:48.8, but from the tcpdump output we can see that during that time 1.89 is rejecting the vrrp adverts with ICMP messages unreachable/adminstratively down. keepalived on 1.89 therefore isn't receiving the adverts from 1.141 which is why 1.89 transitions to master. You should be able to see from the logs the time when 1.89 reports becoming backup, which I think in this case will have been either at 13:18:48 or 13:18:47.
This problem relates somehow to your local setup, and possibly the way the containers are handling the networking. It is not a keepalived issue.
You can continue discussion over there.
Thanks, sorry i read a bit too fast your first message.
Does this also occurs if the container is run --privileged
?
This may be related to a larger problem i'm also facing with keepalived 2.x, it can't find network interface in container :s this is also reported in 1.4.5 when keepalived start but vrrp managed to use the interface after a short period of time.
Not sure what is happening :/
keepalived 1.3.5 also has this problem. after reboot system, the VIP is floating according to ip sequence.
global_defs { router_id LVS_RABBITMQ_PROD1 enable_script_security }
vrrp_script chk_myscript { script "/usr/bin/pgrep sshd" ! "</dev/tcp/127.0.0.1/5672" interval 1 fall 2 rise 2 user root }
vrrp_instance VI_1 { state BACKUP interface ens192 virtual_router_id 51 priority 66 nopreempt advert_int 1 authentication { auth_type PASS auth_pass 123 } unicast_src_ip 192.168.1.222 unicast_peer { 192.168.1.223 }
virtual_ipaddress {
192.168.1.224/24
}
track_script {
chk_myscript
}
}
global_defs { router_id LVS_RABBITMQ_PROD2 enable_script_security }
vrrp_script chk_myscript { script "/usr/bin/pgrep sshd" ! "</dev/tcp/127.0.0.1/5672" interval 1 fall 2 rise 2 user root }
vrrp_instance VI_1 { state BACKUP interface ens192 virtual_router_id 51 priority 66 nopreempt advert_int 1 authentication { auth_type PASS auth_pass 123 } unicast_src_ip 192.168.1.223 unicast_peer { 192.168.1.222 }
virtual_ipaddress {
192.168.1.224/24
}
track_script {
chk_myscript
}
}
[Unit] Description=LVS and VRRP High Availability Monitor After=syslog.target network-online.target After=rabbitmq-server.service Requires=rabbitmq-server.service
log in server 1 Mar 2 08:16:43 FID Keepalived_vrrp[2765]: VRRP_Script(chk_myscript) succeeded Mar 2 08:16:44 FID rabbitmq-server: completed with 4 plugins. Mar 2 08:16:47 FID Keepalived_vrrp[2765]: VRRP_Instance(VI_1) Transition to MASTER STATE Mar 2 08:16:47 FID Keepalived_vrrp[2765]: VRRP_Instance(VI_1) Received advert with higher priority 66, ours 66 Mar 2 08:16:47 FID Keepalived_vrrp[2765]: VRRP_Instance(VI_1) Entering BACKUP STATE log in server 2 Mar 2 08:14:25 FID Keepalived_vrrp[2778]: VRRP_Instance(VI_1) Transition to MASTER STATE Mar 2 08:14:26 FID Keepalived_vrrp[2778]: VRRP_Instance(VI_1) Entering MASTER STATE Mar 2 08:14:26 FID Keepalived_vrrp[2778]: VRRP_Instance(VI_1) setting protocol VIPs. Mar 2 08:14:26 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:26 FID Keepalived_vrrp[2778]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.224 Mar 2 08:14:26 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:26 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:26 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:26 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:31 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:31 FID Keepalived_vrrp[2778]: VRRP_Instance(VI_1) Sending/queueing gratuitous ARPs on ens192 for 192.168.1.224 Mar 2 08:14:31 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:31 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:31 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224 Mar 2 08:14:31 FID Keepalived_vrrp[2778]: Sending gratuitous ARP on ens192 for 192.168.1.224
log in server 1 Mar 2 08:14:25 FID Keepalived_vrrp[2763]: VRRP_Instance(VI_1) Received advert with higher priority 66, ours 66 Mar 2 08:14:25 FID Keepalived_vrrp[2763]: VRRP_Instance(VI_1) Entering BACKUP STATE Mar 2 08:14:25 FID Keepalived_vrrp[2763]: VRRP_Instance(VI_1) removing protocol VIPs. log in server 2 Mar 2 08:16:47 FID Keepalived_vrrp[2778]: VRRP_Instance(VI_1) Received advert with lower priority 66, ours 66, forcing new election
i have to follow steps to prohibit the VIP floating. After OS reboot. 1) "systemctl disable keepalived" to disable keepalived auto start 2) change prority in conf to a small number, eg. 65 3) "systemctl start keepalived" to start keepalived after OS reboot.
Problem
nopreempt
works great when docker service stops/restarts; when my network interface goes down; and when I restart the keepalived container, but when I restart the machine with 51 priority, it takes back the control from the other node(it preempts). Following the discussion here, I added a 60s delay before startup of the keepalived service inside my container (in process.sh) but it still preempts the node with lower priority after a minute. What could possibly be wrong here? Obviously it isn't the network because it doesn't take that long to initialize. This is a clone of this issue.Configuration
My configuration file looks like below:
Logs
I also tried to manually start the container after some time upon reboot and it still preempts the lower priority node. I'm getting following logs after rebooting higher priority node:
tcpdump
I got the
tcpdump
at the reboot time of the higher priority node. Machine 1.89 has 51 priority and 1.141 has 50 priority (on which I'm dumping) with the above-mentioned configuration.In this dump, the machine with priority 51 (1.89) goes down at 13:13:15 and comes alive again at 13:13:37. Keepalived is started after 5 minutes delay and the preemption occurs. You can see the preemption happening at 13:18:52. Let me know if any further information is required to point out the issue.