projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.88k stars 1.31k forks source link

Windows CNI broken after latest EKS image update #9043

Closed davidgiga1993 closed 2 days ago

davidgiga1993 commented 1 month ago

We're using EKS on AWS with Calcio VXlan.

After updating the node image from ami-05b4e05d429e7759b (Windows_Server-2022-English-Core-EKS_Optimized-1.29-2024.06.17) to ami-0f11d4c28a09d26d2 (Windows_Server-2022-English-Core-EKS_Optimized-1.29-2024.07.10)

it is not possible anymore to reach any IP anymore:

new-object System.Net.Sockets.TcpClient("172.20.0.1", 443)
new-object : Exception calling ".ctor" with "2" argument(s): "An attempt was made to access a socket in a way forbidden by its access permissions 172.20.0.1:443"

Expected Behavior

The IPs should be reachable

Current Behavior

No IPs are reachable from inside the container, on the node itself (and host containers) network communication works fine.

Possible Solution

Steps to Reproduce (for bugs)

  1. Deploy EKS with the latest windows AMI
  2. Deploy calico
  3. Deploy dummy pod
  4. Communication isn't working

Context

Downgrading the AMI resolves the issue, thus I suspect it's somehow related to the CVE-2024-5321 as this was (according to amazon) the only change in this image. Maybe related to #9019

Your Environment

coutinhop commented 1 month ago

@davidgiga1993 can you confirm if you have Windows patch KB5040437 installed? There's currently a known issue going on with this Windows update (not caused by Calico): https://github.com/microsoft/Windows-Containers/issues/516 https://github.com/kubernetes/test-infra/pull/33042

davidgiga1993 commented 1 month ago

@davidgiga1993 can you confirm if you have Windows patch KB5040437 installed? There's currently a known issue going on with this Windows update (not caused by Calico): microsoft/Windows-Containers#516 kubernetes/test-infra#33042

Yes I can confirm. I just hope it's related to the Windows update issue as my error message differs from the ones reported by others

avin3sh commented 1 month ago

Do you want to share the behavior observed with your pods in the Windows-Containers issue linked about so that folks at Microsoft are aware of various ways this is affecting the Pod behavior (and that this is widespread) ?

JamesKehr commented 1 week ago

@davidgiga1993 Please follow these steps and let me know if it resolves the issue with the July or August update installed.

  1. Open regedit (Registry Editor).
  2. Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hns\State
  3. Add or update the following value to the State key:

Name : FwPerfImprovementChange Type : DWORD Value : 0

  1. Reboot [required].
  2. Test
davidgiga1993 commented 1 week ago

I'll try on Monday, however I'm not sure I can actually reboot the machine as the autoscaling group will detect the node as dead and remove/terminate it. But I'll try, maybe I can set it during boot

avin3sh commented 1 week ago

@JamesKehr you might want to share this in https://github.com/microsoft/Windows-Containers/issues/516 -- there a lot more folks subscribed there with different configurations

ilueckel commented 1 week ago

@davidgiga1993 Please follow these steps and let me know if it resolves the issue with the July or August update installed.

1. Open regedit (Registry Editor).

2. Go to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hns\State

3. Add or update the following value to the State key:

Name : FwPerfImprovementChange Type : DWORD Value : 0

4. Reboot [required].

5. Test

I'm not on AWS/EKS, but self hosted Rancher+rke2. This worked for me

JamesKehr commented 1 week ago

@avin3sh Done! Thanks for the tip!

@ilueckel Thank you for the confirmation!

@davidgiga1993 You will likely need to work with AWS support to make that change. That registry value is read when the HNS service starts.

You can try, but no guarantees, to set the reg value, stop all the k8s/Calico containers and services, restart the Host Networking Service (HNS) in Windows, and then fire everything back up. Assuming you have that level of control over the node, that might work.

Please let me know either way.

Argannor commented 2 days ago

@JamesKehr following your comment with the restart less fix (in the windows container issue) we were able to apply the hotfix on EKS with Calico and the networking works again.

(I'm working together with @davidgiga1993)

coutinhop commented 2 days ago

@JamesKehr @Argannor @davidgiga1993 thanks for the fix and the updates, closing this now.