Closed Champ-Goblem closed 3 years ago
@kevinGC Let us know if we can help in anyway or jump on a call with you to show the issues live
@kevinGC could you take a look?
Yep, I'll take a look. I suspect this has to do with how we track TCP state in long-lived connections in combination with iptables.
On Thu, Aug 5, 2021 at 9:55 PM Bhasker Hariharan @.***> wrote:
@kevinGC https://github.com/kevinGC could you take a look?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/google/gvisor/issues/6317#issuecomment-894000026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANDVDNKKNONP5V3E6DX4GDT3NTNFANCNFSM5AL4IF4A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
@Champ-Goblem Question, do the timeouts happen if the Istio side car is not in use? The primary difference when istio is in use is gVisor's iptable implementation will actively route connections via the Envoy proxy and the connection tracking may not be doing the right thing as @kevinGC mentioned above. But if it repros without Istio then its most likely a bug elsewhere(though your script writes every 2 minutes which means Keep-alive should never kick in as it defaults to 2hr default idle interval before sending the first probe).
Another thing would it be possible to get a pcap of all the traffic when this bug repros ( from both client and server side ). That would give us a good insight into what is going on here.
@hbhasker the two captures are available on a storage bucket which the links for can be found below. I can confirm that the disconnects happen only when the istio sidecar is used, without it all seems to be okay.
Thanks again for looking into this for us.
MySql client pcap: https://storage.googleapis.com/networking-pcaps/tcp/mysql-client.pcap MySql server pcap: https://storage.googleapis.com/networking-pcaps/tcp/mysql-server.pcap
@Champ-Goblem I am looking at the pcap but a few things are not making sense to me.
There seem to be a few IP addresses involved.
Looks like client is 10.64.0.28 and MySQL server is running on 10.60.25.244 (from the DNS response). But then the pcap is all between 10.64.0.28 and 10.64.0.26 (where 10.64.0.26 seems to be the IP of the mysql server). Is this capture being done from inside the sandbox? or on the host?
Another strange behaviour I see is Netstack does not seem to be accepting the RSTs sent by MySQL server in a case ( as seen by the repeated retransmissions despite the incoming RST from the server.) Its possible that with envoy in the middle these RSTs are never reaching the netstack endpoint and it just keeps retransmitting. But its hard to say.
That said, this pcap is only about 18 minutes long , was this the only failure observed?
Given that these issues only come up when using istio, there is a high likelihood that this is a bug in our IPTable / connection tracking code. That said its unclear to me why this specific connection did not get NATted? That should have been redirected like all the other MySQL connections but looks like this one for some reason did not.
No. Time Source Destination Protocol Length Info
1272 2021-10-13 14:59:45.102072 10.64.0.28 10.60.25.244 MySQL 114 Request Query
1273 2021-10-13 14:59:45.102356 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1283 2021-10-13 14:59:46.102696 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394053121 TSecr=1525322972
1284 2021-10-13 14:59:46.103015 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1285 2021-10-13 14:59:46.303796 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394053322 TSecr=1525322972
1286 2021-10-13 14:59:46.304079 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1289 2021-10-13 14:59:46.704025 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394053722 TSecr=1525322972
1290 2021-10-13 14:59:46.704297 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1309 2021-10-13 14:59:47.504480 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394054523 TSecr=1525322972
1310 2021-10-13 14:59:47.504757 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1322 2021-10-13 14:59:49.105596 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394056124 TSecr=1525322972
1323 2021-10-13 14:59:49.105863 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1344 2021-10-13 14:59:52.306558 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394059325 TSecr=1525322972
1345 2021-10-13 14:59:52.306820 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1385 2021-10-13 14:59:58.707511 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394065726 TSecr=1525322972
1386 2021-10-13 14:59:58.707757 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1457 2021-10-13 15:00:11.508128 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394078526 TSecr=1525322972
1458 2021-10-13 15:00:11.508388 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1636 2021-10-13 15:00:37.108599 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394104127 TSecr=1525322972
1637 2021-10-13 15:00:37.108872 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
1978 2021-10-13 15:01:28.309714 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394155328 TSecr=1525322972
1979 2021-10-13 15:01:28.309954 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
2639 2021-10-13 15:03:10.710837 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394257729 TSecr=1525322972
2640 2021-10-13 15:03:10.711132 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
3421 2021-10-13 15:05:10.711449 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394377730 TSecr=1525322972
3422 2021-10-13 15:05:10.711757 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
4207 2021-10-13 15:07:10.711625 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394497730 TSecr=1525322972
4208 2021-10-13 15:07:10.711966 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
4995 2021-10-13 15:09:10.712737 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394617731 TSecr=1525322972
4996 2021-10-13 15:09:10.713081 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
5780 2021-10-13 15:11:10.713126 10.64.0.28 10.60.25.244 TCP 114 [TCP Retransmission] 55096 → 3306 [PSH, ACK] Seq=804228880 Ack=3219989980 Win=7063 [TCP CHECKSUM INCORRECT] Len=48 TSval=3394737731 TSecr=1525322972
5781 2021-10-13 15:11:10.713427 10.60.25.244 10.64.0.28 TCP 54 3306 → 55096 [RST] Seq=3219989980 Win=0 Len=0
NOTE: I opened an issue for a bug I observed in our tcp_conntrack which may also cause some of these odd behaviours.
To give an overview of the setup for this recreation, we had a new cluster with mysql and the client deployed on one node. The packet captures were taken from the node in the network namespace for each pod and all traffic on the eth
interface was captured:
tcpdump -i eth0 -w <pcap>
This means some other traffic to the envoy proxy such as config (xds) updates from the istio control pane will also be present in the capture.
I was unable to capture any packets with tcpdump from within the sandbox due to the error:
You don't have permission to capture on that device
(socket: Operation not permitted)
I couldn't remember which IP address corresponds to which host so I have redone the capture for you and filtered it on the IPs of the client and mysql and also disabled mtls between the two so you are able to see the data in transit, these new captures can be found at (client: https://storage.googleapis.com/networking-pcaps/14-10-2021/tcp/mysql-client.pcap and server: https://storage.googleapis.com/networking-pcaps/14-10-2021/tcp/mysql-server.pcap)
The new respective IP addresses are:
mysql: 10.64.0.6
client: 10.64.0.7
A test stops once a failure occurs, so normally it will update the first entry in the database and fail on the second entry. We stop/fail here as the workload will crash/exit in a real workload without error handling. We have noticed that when it fails it pauses a long time before the final failure message is displayed to the user and overall it takes around 20mins from start to get the final failure message. This is most likely a by-product of the issue along with some timeout value set somewhere in the php mysql client.
Thanks for these captures, these are much better. In this case though I am not seeing any timeouts but I do see the client resetting the connection unexpectedly and given the timestamps it seems highly likely to be a bug in our connection tracking.
No. Time Source Destination Protocol Length Info 40 2021-10-14 12:15:52.634887 10.64.0.6 10.64.0.7 MySQL 118 Response OK 41 2021-10-14 12:15:52.635025 10.64.0.7 10.64.0.6 TCP 66 21363 → 3306 [ACK] Seq=3249895748 Ack=2275773778 Win=904832 [TCP CHECKSUM INCORRECT] Len=0 TSval=2468438751 TSecr=1334794962 <----- connection is idle for exactly 2 minutes ----> 42 2021-10-14 12:17:52.637287 10.64.0.7 10.64.0.6 MySQL 114 Request Query <--- forward packet to server goes fine ----> 43 2021-10-14 12:17:52.637476 10.64.0.6 10.64.0.7 TCP 66 3306 → 21363 [ACK] Seq=2275773778 Ack=3249895796 Win=524160 [TCP CHECKSUM INCORRECT] Len=0 TSval=1334914965 TSecr=2468558753 <--- this packet probably generates the RST below because the connection entry got removed, so the reverse NAT mapping was lost when this packet arrived --> 44 2021-10-14 12:17:52.637613 10.64.0.7 10.64.0.6 TCP 54 21363 → 3306 [RST] Seq=3249895796 Win=0 Len=0 <-- resulting in the netstack generating a RST as it probably got delivered to the wrong port -> 45 2021-10-14 12:17:53.637622 10.64.0.7 10.64.0.6 MySQL 114 [TCP Spurious Retransmission] Request Query <-- now the client retransmits after a timeout, but the RST above has blown the state on server so server now generates a RST shown below.-> 46 2021-10-14 12:17:53.637831 10.64.0.6 10.64.0.7 TCP 54 3306 → 21363 [RST] Seq=2275773778 Win=0 Len=0
I am going to do some digging our conntrack + iptable code to see if I can repro this locally.
BTW independent of the other HTTP failure issue has this been resolved now @HEAD?
@hbhasker yes at the moment we believe the changes to HEAD have resolved the above issue.
Thanks for confirming. Closing this issue.
Description
We seem to be experiencing an issue where long lived tcp connections between a client and a mysql database fail after a period of time when both containers are running on gvisor, with an istio-proxy sidecar and mtls enabled. The connection tends to timeout sometime between 20 minutes and 70 minutes with failures that happen at regular intervals (eg 21m, 41m, 61m). When the same workload is run without gvisor we dont experience any connection problems and the job being run completes within 45 minutes. We have spent some time debugging this issue and have looked over traces of system calls at the time of failures, taken tcp dumps and debugged runsc with delve but are still unsure what could be causing the problem.
The issues seems to stem from the client side rather than the server side as running the mysql server without gvisor still experiences connection timeouts, but with mysql running on gvisor and the client not, the job completes successfully. I think we may need someone with experience on gvisor's network stack to give some pointers as where to look for further debugging on our end or to potentially investigate themselves. The attached zip folder should contain all of the necessary scripts and resources to setup a failing test on a new cluster provided gvisor has already been setup. The test provided was written in PHP but has also failed in other languages as well.
We have also witnessed occasions where the workload doesnt fail explicitly but stops updating database entries after a period of time and just hangs on some entry. This can be observed for the current test when the logs stop on some entry for a indefinite period of time eg
Shop: <number>
.Steps to reproduce
istio/deploy.sh
(requires istioctl)istio-injection: enabled
db/deploy.sh
(requires helm)2/2
status) add some values to it withdb/insert-data.sh
script and wait for theinsert-data
job to finish (takes a few minutes)app/build.sh
passing an argument for image name and tag (egbuild.sh "eu.gcr.io/<name>:latest"
) (requires docker)app/deploy.sh
either using our prebuilt public image, or one built in the previous stepphp-mysql-tcp-test
job running and see if it fails with the error:The PHP app opens a connection to the database and performs a read of all the entries in the db, then follows by waiting for a minute before updating each entry one by one. Between each subsequent update the wait time is roughly 2 minutes, and these waits can be controlled by the
TIMEOUT
environment variable.A public version of the same image has been built and published to docker hub and this is used by default when
app/deploy.sh
is run: https://hub.docker.com/r/deciderwill/php-sleep-gvisor-tcp-debugThe same files in the attached zip folder are also available at: https://github.com/northflank/mysql-tcp-connection-problems-gvisor
Environment
v1.19.10-gke.1700
through tov1.20.7-gke.1800
and a release version that was tested manually that also experienced the issue:
Linux 5.4.104+ #1 SMP Tue Apr 6 09:49:56 PDT 2021 x86_64 Intel(R) Xeon(R) CPU @ 2.00GHz GenuineIntel GNU/Linux
1.4.3
gvisor-issue.zip