kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
111.15k stars 39.68k forks source link

DNS resolution failure in Windows Pod #100126

Closed thxCode closed 3 years ago

thxCode commented 3 years ago

What happened:

DNS resolution failure in Windows Pod.

What you expected to happen:

Make cluster DNS resolution work.

How to reproduce it (as minimally and precisely as possible):

  1. Setup ACK
  2. Create a Windows NodePool for Windows Service 2019 DataCenter With Container VHD.
  3. Deploy a Windows workload (Deployment) for mcr.microsoft.com/dotnet/framework/samples:aspnetapp image.
  4. k exec the above Pod, run ping google.com but fail to DNS resolution failure.

Anything else we need to know?:

Cluster Info
Verification
Process

The first time I met this issue was in 10.0.17763.379, as https://github.com/Azure/AKS/issues/1029#issuecomment-519647927 said, it was being fixed in 10.0.17763.557. At this time, I plan to solve it with some upgrades. Unfortunately, I tried with UBR 1728/1757/1790, nothing changed.

According to vfpctrl /list-nat-range /port xxx, I found that there are not NAT RANGE rules In the issued workers. So I restarted the hns Windows service, and then the NAT RANGE rules are created. Now the Pod's DNS resolution is work.

When I scaled up the workload to create another new Pod in the same issued worker, things are starting to get a little weird, the newly created Pod is still failing in DNS resolution. It seems this issue is caused by the CNI plugin. So I change the CNI configuration and use the latest win-bridge to try, again, no luck.

Finally, I caught a clue after rebuilding the CNI plugin to insert an interruption between HNSEndpoint sharing.

  1. After the CNI plugin creates an HNSEndpoint. image

  2. After the CNI plugin attaches the sandbox(pause) container to the new HNSEndpoint. image

  3. After the CNI plugin shares the new HNSEndpoint to the business container, there are two NAT RANGE rules have been applied. image

  4. After a while, the VFP refreshes/cleans these rules. image

  5. Restart the hns Windows service, these rules are back. image

Restarting winnat Windows service is still nothing to do with this.

Environment:

k8s-ci-robot commented 3 years ago

@thxCode: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
thxCode commented 3 years ago

/sig windows /sig network /kind bug

thxCode commented 3 years ago

@PatrickLang, have any idea? thanks a lot.

thxCode commented 3 years ago

Before this, I updated these workers from 10.0.17763.1282 to 10.0.17763.1577.

jsturtevant commented 3 years ago

Are you seeing this on newer versions of k8s? We are not running tests against 1.16 anymore: https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md

@daschott in case he is aware of anything

daschott commented 3 years ago

What is your current state of the OS? I think .1577 is still a few months older, .1790 should be current.

Please also attach https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1 output.

thxCode commented 3 years ago

@jsturtevant , @daschott , thanks, this issue was not found until I upgrade the host from 10.0.17763.1282 to 10.0.17763.1577. Based on past experience, I try to fix this by upgrading. I have tried these UBRs(1728/1757/1790/1817), but this issue is still here. The current version of the host is 10.0.17763.1817, diagnose.zip is the archive of collectlogs.ps1 output, PTAL.

thxCode commented 3 years ago

I found that the DNS resolution failure is caused by disabling the forwarding of the bridge HNSNetwork. The DNS resolution seems well after enabling the forwarding, thanks @daschott.

jsturtevant commented 3 years ago

Was there a command to diagnose and fix?