openinfrastructure / terraform-google-multinic

Connect two VPC networks with an auto-healing, auto-scaling group of IP router instances.
Apache License 2.0
10 stars 5 forks source link

Resolve race between google-network-daemon and google-startup-scripts #1

Closed jeffmccune closed 4 years ago

jeffmccune commented 5 years ago

Summary:

The configuration of policy routing is not reliable on boot. A new instance with incorrect policy routing results in packets forwarded out eth0 instead of eth1:

[jmccune@endpoint-vpc-link-shared-vpc-1 ~]$ ping 10.19.16.45
PING 10.19.16.45 (10.19.16.45) 56(84) bytes of data.
From 10.0.17.54 icmp_seq=1 Time to live exceeded
From 10.0.17.54 icmp_seq=2 Time to live exceeded
From 10.0.17.54 icmp_seq=3 Time to live exceeded
From 10.0.17.54 icmp_seq=4 Time to live exceeded
From 10.0.17.54 icmp_seq=5 Time to live exceeded
From 10.0.17.54 icmp_seq=6 Time to live exceeded

The TTL is exceed because a packet sent out eth0 is sent right back into eth0 by the Shared VPC static route.

The incorrect policy routing configuration is likely caused by a race between the google-network-daemon.service and the google-startup-scripts.service:

journalctl -u google-startup-scripts.service

-- Logs begin at Fri 2019-08-23 20:32:29 UTC, end at Fri 2019-08-23 21:29:27 UTC. --
Aug 23 20:32:43 vpc-link-lfwd-us-central1-a-px22 systemd[1]: Starting Google Compute Engine Startup Scripts...
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO Starting startup scripts.
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO Found startup-script in metadata.
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:44 +0000 2019 Debug [1514]: BEGIN: stdlib::cmd() command=[systemctl restart systemd-sysctl.service]
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: END: stdlib::cmd() command=[systemctl restart systemd-sysctl.service] exit_code=0
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Info [1514]: IP Forwarding enabled via /etc/sysctl.d/50-ip-router.conf
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: BEGIN: stdlib::cmd() command=[ip route add 10.0.3.0/24 src 10.0.3.55 dev eth1 table rt1]
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: END: stdlib::cmd() command=[ip route add 10.0.3.0/24 src 10.0.3.55 dev eth1 table rt1] exit_co
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: BEGIN: stdlib::cmd() command=[ip route add default via 10.0.3.1 dev eth1 table rt1]
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: END: stdlib::cmd() command=[ip route add default via 10.0.3.1 dev eth1 table rt1] exit_code=0
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: BEGIN: stdlib::cmd() command=[ip rule add from 10.0.3.55/32 table rt1]
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: END: stdlib::cmd() command=[ip rule add from 10.0.3.55/32 table rt1] exit_code=0
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: BEGIN: stdlib::cmd() command=[ip rule add to 10.0.3.55/32 table rt1]
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: END: stdlib::cmd() command=[ip rule add to 10.0.3.55/32 table rt1] exit_code=0
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: BEGIN: stdlib::cmd() command=[ip rule add to 10.19.16.0/20 table rt1]
Aug 23 20:32:45 vpc-link-lfwd-us-central1-a-px22 startup-script[1468]: INFO startup-script: Fri Aug 23 20:32:45 +0000 2019 Debug [1514]: END: stdlib::cmd() command=[ip rule add to 10.19.16.0/20 table rt1] exit_code=0
journalctl -u google-network-daemon
-- Logs begin at Fri 2019-08-23 20:32:29 UTC, end at Fri 2019-08-23 21:29:27 UTC. --
Aug 23 20:32:43 vpc-link-lfwd-us-central1-a-px22 systemd[1]: Started Google Compute Engine Network Daemon.
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 google-networking[1462]: INFO Starting Google Networking daemon.
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 network-setup[1462]: INFO Disabling IPv6 on Ethernet interface: ['eth0'].
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 network-setup[1462]: INFO Calling Dhclient for IPv6 configuration on the Ethernet interfaces ['eth0'].
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 dhclient[1519]: Internet Systems Consortium DHCP Client 4.2.5
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 dhclient[1519]: Copyright 2004-2013 Internet Systems Consortium.
Aug 23 20:32:48 vpc-link-lfwd-us-central1-a-px22 google_network_daemon[1462]: Internet Systems Consortium DHCP Client 4.2.5
Aug 23 20:32:48 vpc-link-lfwd-us-central1-a-px22 google_network_daemon[1462]: Copyright 2004-2013 Internet Systems Consortium.
Aug 23 20:32:48 vpc-link-lfwd-us-central1-a-px22 google_network_daemon[1462]: All rights reserved.
Aug 23 20:32:48 vpc-link-lfwd-us-central1-a-px22 google_network_daemon[1462]: For info, please visit https://www.isc.org/software/dhcp/
Aug 23 20:32:48 vpc-link-lfwd-us-central1-a-px22 google_network_daemon[1462]: Listening on Socket/eth0
Aug 23 20:32:48 vpc-link-lfwd-us-central1-a-px22 google_network_daemon[1462]: Sending on   Socket/eth0
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 dhclient[1519]: All rights reserved.
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 dhclient[1519]: For info, please visit https://www.isc.org/software/dhcp/
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 dhclient[1519]: [2B blob data]
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 dhclient[1519]: Listening on Socket/eth0
Aug 23 20:32:44 vpc-link-lfwd-us-central1-a-px22 dhclient[1519]: Sending on   Socket/eth0
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 network-setup[1462]: WARNING Could not release IPv6 lease on interface ['eth0'].
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 network-setup[1462]: INFO Ethernet interfaces: ['eth1'].
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 network-setup[1462]: INFO Created config file for interface eth1.
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 network-setup[1462]: INFO Enabling the Ethernet interfaces ['eth1'].
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 dhclient[1699]: DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 6 (xid=0x66b0146c)
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 dhclient[1699]: DHCPREQUEST on eth1 to 255.255.255.255 port 67 (xid=0x66b0146c)
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 dhclient[1699]: DHCPOFFER from 169.254.169.254
Aug 23 20:32:49 vpc-link-lfwd-us-central1-a-px22 dhclient[1699]: DHCPACK from 169.254.169.254 (xid=0x66b0146c)

Next steps:

jeffmccune commented 5 years ago

/usr/lib/systemd/system/google-startup-scripts.service already specifies After=google-instance-setup.service google-network-daemon.service and yet dhclient eth1 executes in parallel with the startup-script policy routing behavior.

Next step is to trace through /usr/sbin/dhclient-script and cause policy routing to be configured from /etc/dhcp/dhclient-exit-hooks.d/policy-routing.sh. Need to identify the set of variables for when a lease is obtained.

jeffmccune commented 5 years ago

Version 0.5.0 should fix this, next steps are to test it out with a rolling update.