Windows containers lose every tcp connection after 4 minutes of idle

gcalenzo commented 1 year ago

Describe the bug We are experiencing timeout issues connecting to an oracle database running a .NET framework application based on windows server core ltsc2019 image in an EKS cluster. We tested this image running that on an EC2 Windows Server core 2019 with Docker instance and we didn't get any connection timeout. After deployed it to an EKS cluster with EC2 Windows Server core 2019 instance as nodes, every connection to the database drops if in idle for more than 4 minutes.

Seems to be exactly the same problem describe in issue #269 , but in that case ticket was closed due to support team priority on k8s and containerd, which is my case: every connection made between the port assigned to the container by SNAT on the windows host node and any tcp target(not only the oracle database), drops after exactly 4 minutes.

To Reproduce Steps to reproduce the behavior:

Have an EKS cluster with windows server core 2019 node which routes traffic to on premise network using an AWS transit gateway (or similiar scenario, the requirement is that the windows host node creates the connection between the pod and a tcp target host using SNAT)
Open a port listening in tcp on a remote server with netcat;
Connect to that port from a windows based pod with telnet;
Keep the connection open and wait for exactly 4 minutes.

Expected behavior The connection doesn't drop.

Configuration:

Edition: Windows Server core 2019
- Base Image being used: mcr.microsoft.com/windows/servercore:ltsc2019
- EKS version: eks.6
- Kubernetes version: 1.23
- Container engine: containerd

MikeZappa87 commented 1 year ago

@gcalenzo are you able to use tcp keep alive?

gcalenzo commented 1 year ago

Hi @MikeZappa87 ,

we set the keepAlive tcpIp parameter by editing the registry key HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime. We chose 60000 ms as the value, but it doesn't work.

MikeZappa87 commented 1 year ago

I will verify if that is the correct setting. I am curious did you verify you are seeing a tcp keep alive packet with wireshark?

gcalenzo commented 1 year ago

Hi @MikeZappa87 ,

we weren't able to see the tcp keep alive packets on wireshark.

Anyway, with the help of the AWS Support we were able to solve the issue by adding the "ExcludedSnatCIDRs" parameter to the Start-EKSBootstrap.ps1 script of the windows worker nodes, choosing the databases CIDR block as value.

microsoft / Windows-Containers

Windows containers lose every tcp connection after 4 minutes of idle #353