Network connectivity issue with GCP

domcho commented 4 years ago

There seems to be an intermittent connectivity issue with the GCP instance hosting the web server. There is no useful information in /var/log/syslog or in the apache logs.

When the issue is present, it is not possible for a specific client IP to establish any new HTTP or SSH connections to the instance. SSH sessions that were already established before the issue remain active and functional.

Using tcpdump it appears that TCP packets from the client do arrive at the instance but don't make it up to the application (apache or sshd). While this is happening for one client, other clients are able to connect.

While the issue is present, netstat shows only a small number of TCP connections:

$ sudo netstat -tupn
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 10.162.0.3:44334        169.254.169.254:80      ESTABLISHED 784/python3
tcp        0    272 10.162.0.3:22           99.252.83.159:65522     ESTABLISHED 2409/sshd: dom_chor
tcp        0      0 10.162.0.3:22           76.187.13.245:62566     ESTABLISHED 31386/sshd: rubydev
tcp        0      0 10.162.0.3:41746        10.31.144.3:3306        ESTABLISHED 6503/mysql
tcp        0      0 10.162.0.3:22           206.214.78.211:58226    ESTABLISHED 30993/sshd: rubydev
tcp        0      0 10.162.0.3:22           206.214.78.211:41146    ESTABLISHED 2266/sshd: rubydev
tcp        0      0 127.0.0.1:60720         127.0.0.1:39939         ESTABLISHED 29617/sshd: rubydev
tcp        0      0 10.162.0.3:44330        169.254.169.254:80      ESTABLISHED 811/python3
tcp        0      0 10.162.0.3:44336        169.254.169.254:80      ESTABLISHED 825/python3
tcp        0      0 127.0.0.1:39939         127.0.0.1:60720         ESTABLISHED 29747/node
tcp        0     52 10.162.0.3:22           206.214.78.211:57224    ESTABLISHED 29530/sshd: rubydev
tcp        0      0 10.162.0.3:44332        169.254.169.254:80      CLOSE_WAIT  825/python3
tcp        0      0 127.0.0.1:39939         127.0.0.1:60718         ESTABLISHED 29660/node
tcp        0      0 127.0.0.1:60718         127.0.0.1:39939         ESTABLISHED 29617/sshd: rubydev
tcp6       0   3123 10.162.0.3:443          108.162.238.118:51652   ESTABLISHED 2615/apache2
tcp6       0   3123 10.162.0.3:443          108.162.221.188:14514   ESTABLISHED 2617/apache2

There seems to be some connection to DHCP leases on the interface.

DHCP lease is being lost every 30 minutes according to syslog
forcing DHCP renew makes the site accessible via hostname again

tyler-reese commented 4 years ago

Using the IP address of the server where the web host is...http://35.203.100.211 works (partially). Using the hostname...my http request gets there, but not to apache access.log. the source IP when using the hostname is a cloudflare IP. When using the IP address, the source is my IP address. DHCP lease is being lost every 30 minutes according to syslog. Forcing DHCP renew makes the site accessible via hostname again.

tyler-reese commented 4 years ago

Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: DHCP lease lost
Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: IPv6 successfully enabled
Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: DHCPv4 address 10.162.0.3/32 via 10.162.0.1
Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: Configured
Apr 12 17:30:22 instance-1 dbus-daemon[663]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.0' (uid=101 pid=487 comm="/lib/systemd/systemd-networkd " label="unconfined")
Apr 12 17:30:22 instance-1 systemd[1]: Starting Hostname Service...
Apr 12 17:30:23 instance-1 dbus-daemon[663]: [system] Successfully activated service 'org.freedesktop.hostname1'
Apr 12 17:30:23 instance-1 systemd[1]: Started Hostname Service.
Apr 12 17:30:23 instance-1 systemd-hostnamed[3572]: Changed host name to 'instance-1.northamerica-northeast1-a.c.truckerswelcome.internal'
Apr 12 17:30:53 instance-1 systemd[1]: systemd-hostnamed.service: Succeeded.

The website goes down for me as soon as this happens. It eventually (5-10 minutes) comes back.``

domcho commented 4 years ago

I changed the internal IP to "static" in GCP console as per https://cloud.google.com/compute/docs/ip-addresses/reserve-static-internal-ip-address#promote-in-use-internal-address

It turns out however that this is just a reservation, in the guest OS the VM still uses DHCP and the issue noted in the previous comment is still present.

I then tried configuring the guest OS with a static internal IP by editing /etc/netplan/ and doing a sudo netplan apply, but lost network connectivity completely and had to reboot the VM from the GCP console (which restored the netplan file since it is generated by cloud-init or whatever the GCP equivalent is).

At the moment, there is a cron job running every minute as a terrible hack while we try to resolve the issue correctly: /1 * /usr/sbin/dhclient

domcho commented 4 years ago

Most recent update:

We have an instance running Ubuntu 19 in GCP zone northamerica-northeast1-a. At times, different users would experience issues reaching the website hosted on the instance. While troubleshooting the issue, we discovered that any active connections remained active (e.g. for a user experiencing the issue, pre-existing SSH sessions being used to investigate the issue continued to work, but the user was unable to establish new SSH or HTTP connections).

Eventually it was narrowed down to a DHCP issue. When the following was observed in the log file the issue would become present

Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: DHCP lease lost
Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: IPv6 successfully enabled
Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: DHCPv4 address 10.162.0.3/32 via 10.162.0.1
Apr 12 17:30:22 instance-1 systemd-networkd[487]: ens4: Configured
Apr 12 17:30:22 instance-1 dbus-daemon[663]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':1.0' (uid=101 pid=487 comm="/lib/systemd/systemd-networkd " label="unconfined")
Apr 12 17:30:22 instance-1 systemd[1]: Starting Hostname Service...
Apr 12 17:30:23 instance-1 dbus-daemon[663]: [system] Successfully activated service 'org.freedesktop.hostname1'
Apr 12 17:30:23 instance-1 systemd[1]: Started Hostname Service.
Apr 12 17:30:23 instance-1 systemd-hostnamed[3572]: Changed host name to 'instance-1.northamerica-northeast1-a.c.truckerswelcome.internal'
Apr 12 17:30:53 instance-1 systemd[1]: systemd-hostnamed.service: Succeeded.

After 5 to 10 minutes connectivity would be restored. We were able to significantly reduce the frequency of the issue by adding a cron job that issues a DHCP request every minute, but that causes other issues and is not a viable workaround.

While the issue was present, one user used VPN to try connecting from different locations and during that time and provided the following additional information: every 30 minutes, the website becomes unreachable, unless you are in the Toronto, Montreal, Buffalo NY, or Iceland area. I'm sure the list covers more of the northeast, but that's all I could VPN to. Chicago, Dallas, New York, Seattle, Miami, DC, Salt Lake City...all unreachable. The request from the fail locations goes through cloudflare, and arrives at the web server (according to tcpdump), but nothing gets to apache and the page is not served. If I try to go directly to the web server IP address, from my ISP, it works. The following tcpdump shows packets arriving in the instance but not response packets are observed and the client clearly fails to connect.

While this issue was going on, UFW was disabled (sudo ufw disable) and the clients from TX were immediately able to connect. 30 minutes later, and while UFW was still off, the issue appeared again.

domcho commented 4 years ago

Moved to AWS

truckerswelcome / webapp

Network connectivity issue with GCP #32