Open database TCP connection failing after a period of time

Champ-Goblem commented 3 years ago

Description

We seem to be experiencing an issue where long lived tcp connections between a client and a mysql database fail after a period of time when both containers are running on gvisor, with an istio-proxy sidecar and mtls enabled. The connection tends to timeout sometime between 20 minutes and 70 minutes with failures that happen at regular intervals (eg 21m, 41m, 61m). When the same workload is run without gvisor we dont experience any connection problems and the job being run completes within 45 minutes. We have spent some time debugging this issue and have looked over traces of system calls at the time of failures, taken tcp dumps and debugged runsc with delve but are still unsure what could be causing the problem.

The issues seems to stem from the client side rather than the server side as running the mysql server without gvisor still experiences connection timeouts, but with mysql running on gvisor and the client not, the job completes successfully. I think we may need someone with experience on gvisor's network stack to give some pointers as where to look for further debugging on our end or to potentially investigate themselves. The attached zip folder should contain all of the necessary scripts and resources to setup a failing test on a new cluster provided gvisor has already been setup. The test provided was written in PHP but has also failed in other languages as well.

We have also witnessed occasions where the workload doesnt fail explicitly but stops updating database entries after a period of time and just hangs on some entry. This can be observed for the current test when the logs stop on some entry for a indefinite period of time eg Shop: <number>.

Steps to reproduce

Spin up a new cluster with gvsior installed
Install istio 1.9.5 with the script istio/deploy.sh (requires istioctl)
Label a testing namespace with istio-injection: enabled
Deploy the mysql database with db/deploy.sh (requires helm)
Once the database is running (2/2 status) add some values to it with db/insert-data.sh script and wait for the insert-data job to finish (takes a few minutes)
Optionally you can build the app with app/build.sh passing an argument for image name and tag (eg build.sh "eu.gcr.io/<name>:latest") (requires docker)
Deploy the app with app/deploy.sh either using our prebuilt public image, or one built in the previous step

Leave the php-mysql-tcp-test job running and see if it fails with the error:

Fatal error: Uncaught PDOException: SQLSTATE[HY000]: General error: 2006 MySQL server has gone away in /var/php/job-sleep.php:32
Stack trace:
#0 /var/php/job-sleep.php(32): PDOStatement->execute()
#1 {main}
thrown in /var/php/job-sleep.php on line 32

The PHP app opens a connection to the database and performs a read of all the entries in the db, then follows by waiting for a minute before updating each entry one by one. Between each subsequent update the wait time is roughly 2 minutes, and these waits can be controlled by the TIMEOUT environment variable.

A public version of the same image has been built and published to docker hub and this is used by default when app/deploy.sh is run: https://hub.docker.com/r/deciderwill/php-sleep-gvisor-tcp-debug

The same files in the attached zip folder are also available at: https://github.com/northflank/mysql-tcp-connection-problems-gvisor

Environment

k8s: v1.19.10-gke.1700 through to v1.20.7-gke.1800
runsc: version prebuilt on gke node:
```
runsc version google-367396767
spec: 1.0.1-dev
```
and a release version that was tested manually that also experienced the issue:
```
release-20210628.0
```
kernel: Linux 5.4.104+ #1 SMP Tue Apr 6 09:49:56 PDT 2021 x86_64 Intel(R) Xeon(R) CPU @ 2.00GHz GenuineIntel GNU/Linux
containerd: 1.4.3

gvisor-issue.zip

Champ-Goblem commented 3 years ago

@kevinGC Let us know if we can help in anyway or jump on a call with you to show the issues live

hbhasker commented 3 years ago

@kevinGC could you take a look?

kevinGC commented 3 years ago

Yep, I'll take a look. I suspect this has to do with how we track TCP state in long-lived connections in combination with iptables.

On Thu, Aug 5, 2021 at 9:55 PM Bhasker Hariharan @.***> wrote:

@kevinGC https://github.com/kevinGC could you take a look?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/gvisor/issues/6317#issuecomment-894000026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDVDNKKNONP5V3E6DX4GDT3NTNFANCNFSM5AL4IF4A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

hbhasker commented 3 years ago

@Champ-Goblem Question, do the timeouts happen if the Istio side car is not in use? The primary difference when istio is in use is gVisor's iptable implementation will actively route connections via the Envoy proxy and the connection tracking may not be doing the right thing as @kevinGC mentioned above. But if it repros without Istio then its most likely a bug elsewhere(though your script writes every 2 minutes which means Keep-alive should never kick in as it defaults to 2hr default idle interval before sending the first probe).

hbhasker commented 3 years ago

Another thing would it be possible to get a pcap of all the traffic when this bug repros ( from both client and server side ). That would give us a good insight into what is going on here.

Champ-Goblem commented 3 years ago

@hbhasker the two captures are available on a storage bucket which the links for can be found below. I can confirm that the disconnects happen only when the istio sidecar is used, without it all seems to be okay.

Thanks again for looking into this for us.

MySql client pcap: https://storage.googleapis.com/networking-pcaps/tcp/mysql-client.pcap MySql server pcap: https://storage.googleapis.com/networking-pcaps/tcp/mysql-server.pcap

hbhasker commented 3 years ago

@Champ-Goblem I am looking at the pcap but a few things are not making sense to me.

There seem to be a few IP addresses involved.

Looks like client is 10.64.0.28 and MySQL server is running on 10.60.25.244 (from the DNS response). But then the pcap is all between 10.64.0.28 and 10.64.0.26 (where 10.64.0.26 seems to be the IP of the mysql server). Is this capture being done from inside the sandbox? or on the host?

Another strange behaviour I see is Netstack does not seem to be accepting the RSTs sent by MySQL server in a case ( as seen by the repeated retransmissions despite the incoming RST from the server.) Its possible that with envoy in the middle these RSTs are never reaching the netstack endpoint and it just keeps retransmitting. But its hard to say.

That said, this pcap is only about 18 minutes long , was this the only failure observed?

Given that these issues only come up when using istio, there is a high likelihood that this is a bug in our IPTable / connection tracking code. That said its unclear to me why this specific connection did not get NATted? That should have been redirected like all the other MySQL connections but looks like this one for some reason did not.

No. Time    Source  Destination Protocol    Length  Info
1272    2021-10-13 14:59:45.102072  10.64.0.28  10.60.25.244    MySQL   114 Request Query
1273    2021-10-13 14:59:45.102356  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1283    2021-10-13 14:59:46.102696  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394053121 TSecr=1525322972
1284    2021-10-13 14:59:46.103015  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1285    2021-10-13 14:59:46.303796  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394053322 TSecr=1525322972
1286    2021-10-13 14:59:46.304079  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1289    2021-10-13 14:59:46.704025  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394053722 TSecr=1525322972
1290    2021-10-13 14:59:46.704297  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1309    2021-10-13 14:59:47.504480  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394054523 TSecr=1525322972
1310    2021-10-13 14:59:47.504757  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1322    2021-10-13 14:59:49.105596  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394056124 TSecr=1525322972
1323    2021-10-13 14:59:49.105863  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1344    2021-10-13 14:59:52.306558  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394059325 TSecr=1525322972
1345    2021-10-13 14:59:52.306820  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1385    2021-10-13 14:59:58.707511  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394065726 TSecr=1525322972
1386    2021-10-13 14:59:58.707757  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1457    2021-10-13 15:00:11.508128  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394078526 TSecr=1525322972
1458    2021-10-13 15:00:11.508388  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1636    2021-10-13 15:00:37.108599  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394104127 TSecr=1525322972
1637    2021-10-13 15:00:37.108872  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1978    2021-10-13 15:01:28.309714  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394155328 TSecr=1525322972
1979    2021-10-13 15:01:28.309954  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
2639    2021-10-13 15:03:10.710837  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394257729 TSecr=1525322972
2640    2021-10-13 15:03:10.711132  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
3421    2021-10-13 15:05:10.711449  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394377730 TSecr=1525322972
3422    2021-10-13 15:05:10.711757  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
4207    2021-10-13 15:07:10.711625  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394497730 TSecr=1525322972
4208    2021-10-13 15:07:10.711966  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
4995    2021-10-13 15:09:10.712737  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394617731 TSecr=1525322972
4996    2021-10-13 15:09:10.713081  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
5780    2021-10-13 15:11:10.713126  10.64.0.28  10.60.25.244    TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394737731 TSecr=1525322972
5781    2021-10-13 15:11:10.713427  10.60.25.244    10.64.0.28  TCP 54  3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0

hbhasker commented 3 years ago

NOTE: I opened an issue for a bug I observed in our tcp_conntrack which may also cause some of these odd behaviours.

See: https://github.com/google/gvisor/issues/6734

Champ-Goblem commented 3 years ago

To give an overview of the setup for this recreation, we had a new cluster with mysql and the client deployed on one node. The packet captures were taken from the node in the network namespace for each pod and all traffic on the eth interface was captured:

tcpdump -i eth0 -w <pcap>

This means some other traffic to the envoy proxy such as config (xds) updates from the istio control pane will also be present in the capture.

I was unable to capture any packets with tcpdump from within the sandbox due to the error:

You don't have permission to capture on that device
(socket: Operation not permitted)

I couldn't remember which IP address corresponds to which host so I have redone the capture for you and filtered it on the IPs of the client and mysql and also disabled mtls between the two so you are able to see the data in transit, these new captures can be found at (client: https://storage.googleapis.com/networking-pcaps/14-10-2021/tcp/mysql-client.pcap and server: https://storage.googleapis.com/networking-pcaps/14-10-2021/tcp/mysql-server.pcap)

The new respective IP addresses are:

mysql: 10.64.0.6

client: 10.64.0.7

A test stops once a failure occurs, so normally it will update the first entry in the database and fail on the second entry. We stop/fail here as the workload will crash/exit in a real workload without error handling. We have noticed that when it fails it pauses a long time before the final failure message is displayed to the user and overall it takes around 20mins from start to get the final failure message. This is most likely a by-product of the issue along with some timeout value set somewhere in the php mysql client.

hbhasker commented 3 years ago

Thanks for these captures, these are much better. In this case though I am not seeing any timeouts but I do see the client resetting the connection unexpectedly and given the timestamps it seems highly likely to be a bug in our connection tracking.

No. Time Source Destination Protocol Length Info 40 2021-10-14 12:15:52.634887 10.64.0.6 10.64.0.7 MySQL 118 Response OK 41 2021-10-14 12:15:52.635025 10.64.0.7 10.64.0.6 TCP 66 21363 → 3306 [ACK] Seq=3249895748 Ack=2275773778 Win=904832 [TCP CHECKSUM INCORRECT] Len=0 TSval=2468438751 TSecr=1334794962 <----- connection is idle for exactly 2 minutes ----> 42 2021-10-14 12:17:52.637287 10.64.0.7 10.64.0.6 MySQL 114 Request Query <--- forward packet to server goes fine ----> 43 2021-10-14 12:17:52.637476 10.64.0.6 10.64.0.7 TCP 66 3306 → 21363 [ACK] Seq=2275773778 Ack=3249895796 Win=524160 [TCP CHECKSUM INCORRECT] Len=0 TSval=1334914965 TSecr=2468558753 <--- this packet probably generates the RST below because the connection entry got removed, so the reverse NAT mapping was lost when this packet arrived --> 44 2021-10-14 12:17:52.637613 10.64.0.7 10.64.0.6 TCP 54 21363 → 3306 [RST] Seq=3249895796 Win=0 Len=0 <-- resulting in the netstack generating a RST as it probably got delivered to the wrong port -> 45 2021-10-14 12:17:53.637622 10.64.0.7 10.64.0.6 MySQL 114 [TCP Spurious Retransmission] Request Query <-- now the client retransmits after a timeout, but the RST above has blown the state on server so server now generates a RST shown below.-> 46 2021-10-14 12:17:53.637831 10.64.0.6 10.64.0.7 TCP 54 3306 → 21363 [RST] Seq=2275773778 Win=0 Len=0

I am going to do some digging our conntrack + iptable code to see if I can repro this locally.

hbhasker commented 3 years ago

BTW independent of the other HTTP failure issue has this been resolved now @HEAD?

DeciderWill commented 3 years ago

@hbhasker yes at the moment we believe the changes to HEAD have resolved the above issue.

hbhasker commented 3 years ago

Thanks for confirming. Closing this issue.

google / gvisor

Open database TCP connection failing after a period of time #6317