rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.57k stars 268 forks source link

Windows pods with services cannot reach outside cluster network (Most of the times), Calico #2362

Closed MattAxel closed 1 year ago

MattAxel commented 2 years ago

Pods with services cannot reach outside the cluster network. Standalone pods are working fine. On windows nodes ,calico.

Environmental Info: RKE2 Version: rke2.exe version v1.22.5+rke2r1 (ce3e572376cbb1d8157f46e2ae29d7d7834067f1) go version go1.16.10b7

Node(s) CPU architecture, OS, and Version: Caption CSName Version BuildType OSArchitecture


Microsoft Windows Server 2022 Datacenter SAFEPERFKUBW1 10.0.20348 Multiprocessor Free 64-bit

Cluster Configuration: 3 ubuntu 20.04 servers, 2 Win agents. Calico cni plugin Running on vmware VSphere

Describe the bug: Pods on the windows nodes cannot partly reach out to external ipadresses. This only applies if a service is created for the deployments. If it is a pod without any service it works fine. Service type does not seem to matter. One instance of each deployment can most of the time reach out externally. For example 3 pods of the same deployment are running on the same node. Only the lastest created are able to reach out externally. This is not always the case but most of the times. When starting up a new pod it always works until the status changes to ready. Guess that is when kubeproxy are updated. Not sure what to look after in the kubeproxy logs but cannot find any errors... On the linux nodes it works perfectly fine.

Steps To Reproduce: Installed using quick start guide for RKE2 and calico cni since using windows agents. Exec a curl command in a pod

Expected behavior: Pods should always be able to reach external networks

Actual behavior: Pods are not able to access outside cluster network if there is a service connected to the deployment (Most of the times, see description)

Additional context / logs:

rosskirkpat commented 2 years ago

Would you be able to provide the rke2 server args/config file that you used?

Are you using one of the pre-configured RKE2 CIS profiles?

Are you expecting external connectivity to be available for the Windows services?

Do you have your internal DNS servers (assuming you have at least one due to VMware vSphere) configured in the coredns config map?

MattAxel commented 2 years ago

Thanks for your reply.

From the RKE2 server one (/etc/rancher/rke2/config.yaml):

tls-san:
- safeperfkubl1
- safeperfkubl1.infra.local
- safeperfkubcl.infra.local
- 172.17.93.211
disable: rke2-ingress-nginx
cni:
- calico

No have not specified any CSI profile

Yes I expecting external connectivity on the windows services. And it works fine in a few pods. But cannot see any pattern more than it looks like it always works until a pod is set to ready. After that it only works in max on instance of each deployment type.

Resolving the names does not seem like a problem. Works fine even in pods without external connectivity.

{
    "Corefile": ".:53 {
            errors 
            health  {
                lameduck 5s
            }
            ready 
            kubernetes   cluster.local  cluster.local in-addr.arpa ip6.arpa {
                pods insecure
                fallthrough in-addr.arpa ip6.arpa
                ttl 30
            }
            prometheus   0.0.0.0:9153
            forward   . /etc/resolv.conf
            cache   30
            loop 
            reload 
            loadbalance 
        }"
}

Guess it forwards to /etc/resolv.conf and uses 127.0.0.53 in that file. Systemd resolved.. But changed to:

{
...
            forward   . 172.17.93.2
...
}

(Did not make any difference unfortunately)

MattAxel commented 2 years ago

Created a new cluster with one control plane node and two windows workers. One worker with win 2019 and one with 2022. Worked perfectly fine on the win 2019 and got the same issue described above on the win2022....

phillipsj commented 2 years ago

@MattAxel thanks for the update and the additional information.

caroline-suse-rancher commented 1 year ago

Closing this due to age and inactivity.