DNS resolution failure in Windows Pod

thxCode commented 3 years ago

What happened:

DNS resolution failure in Windows Pod.

What you expected to happen:

Make cluster DNS resolution work.

How to reproduce it (as minimally and precisely as possible):

Setup ACK
Create a Windows NodePool for Windows Service 2019 DataCenter With Container VHD.
Deploy a Windows workload (Deployment) for mcr.microsoft.com/dotnet/framework/samples:aspnetapp image.
k exec the above Pod, run ping google.com but fail to DNS resolution failure.

Anything else we need to know?:

Cluster Info

in-cluster DNS Pod 1 IP: 10.0.5.153
in-cluster DNS Pod 2 IP: 10.0.5.149
in-cluster DNS Service VIP: 192.168.0.10

Verification

ipconfig /all inside Pod, checked, the DNS servers and suffix search list are expected.
Resovle-DnsName -Name google.com -Server <in-cluster DNS Pod/Service IP>, checked, failed to resolve.
Resolve-DnsName -Name google.com -Server 8.8.8.8, checked, succeed to resolve.
Get-HNSEndpoint | ? IPAddress -eq <in-cluster DNS Pod IP>, checked, both two endpoints existed.
Get-HnsPolicyList | Where-Object { $_.Policies[0].Protocol -eq 17 } | Where-Object { $_.Policies[0].VIPs[0] -eq <in-cluster DNS Service IP> } | Select-Object -Property References, checked, the policy referred to the above two endpoints.

Process

The first time I met this issue was in 10.0.17763.379, as https://github.com/Azure/AKS/issues/1029#issuecomment-519647927 said, it was being fixed in 10.0.17763.557. At this time, I plan to solve it with some upgrades. Unfortunately, I tried with UBR 1728/1757/1790, nothing changed.

According to vfpctrl /list-nat-range /port xxx, I found that there are not NAT RANGE rules In the issued workers. So I restarted the hns Windows service, and then the NAT RANGE rules are created. Now the Pod's DNS resolution is work.

When I scaled up the workload to create another new Pod in the same issued worker, things are starting to get a little weird, the newly created Pod is still failing in DNS resolution. It seems this issue is caused by the CNI plugin. So I change the CNI configuration and use the latest win-bridge to try, again, no luck.

Finally, I caught a clue after rebuilding the CNI plugin to insert an interruption between HNSEndpoint sharing.

After the CNI plugin creates an HNSEndpoint.
After the CNI plugin attaches the sandbox(pause) container to the new HNSEndpoint.
After the CNI plugin shares the new HNSEndpoint to the business container, there are two NAT RANGE rules have been applied.
After a while, the VFP refreshes/cleans these rules.
Restart the hns Windows service, these rules are back.

Restarting winnat Windows service is still nothing to do with this.

Environment:

Kubernetes version (use kubectl version): v1.16.6
Cloud provider or hardware configuration: ACK (Alibaba Kubernetes Service)
OS (e.g: cat /etc/os-release): Windows Server 2019
Kernel (e.g. uname -a): 10.0.17763.1637
Install tools:
Network plugin and version (if this is a network-related bug): wincni in L2Bridge mode / win-bridge v0.9.1
Others:

k8s-ci-robot commented 3 years ago

@thxCode: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

thxCode commented 3 years ago

/sig windows /sig network /kind bug

thxCode commented 3 years ago

@PatrickLang, have any idea? thanks a lot.

thxCode commented 3 years ago

Before this, I updated these workers from 10.0.17763.1282 to 10.0.17763.1577.

jsturtevant commented 3 years ago

Are you seeing this on newer versions of k8s? We are not running tests against 1.16 anymore: https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md

@daschott in case he is aware of anything

daschott commented 3 years ago

What is your current state of the OS? I think .1577 is still a few months older, .1790 should be current.

Please also attach https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1 output.

thxCode commented 3 years ago

@jsturtevant , @daschott , thanks, this issue was not found until I upgrade the host from 10.0.17763.1282 to 10.0.17763.1577. Based on past experience, I try to fix this by upgrading. I have tried these UBRs(1728/1757/1790/1817), but this issue is still here. The current version of the host is 10.0.17763.1817, diagnose.zip is the archive of collectlogs.ps1 output, PTAL.

thxCode commented 3 years ago

I found that the DNS resolution failure is caused by disabling the forwarding of the bridge HNSNetwork. The DNS resolution seems well after enabling the forwarding, thanks @daschott.

jsturtevant commented 3 years ago

Was there a command to diagnose and fix?

kubernetes / kubernetes