Closed thxCode closed 3 years ago
@thxCode: This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
/sig windows /sig network /kind bug
@PatrickLang, have any idea? thanks a lot.
Before this, I updated these workers from 10.0.17763.1282
to 10.0.17763.1577
.
Are you seeing this on newer versions of k8s? We are not running tests against 1.16 anymore: https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md
@daschott in case he is aware of anything
What is your current state of the OS? I think .1577 is still a few months older, .1790 should be current.
Please also attach https://github.com/microsoft/SDN/blob/master/Kubernetes/windows/debug/collectlogs.ps1 output.
@jsturtevant , @daschott , thanks, this issue was not found until I upgrade the host from 10.0.17763.1282
to 10.0.17763.1577
. Based on past experience, I try to fix this by upgrading. I have tried these UBRs(1728/1757/1790/1817
), but this issue is still here. The current version of the host is 10.0.17763.1817
, diagnose.zip is the archive of collectlogs.ps1
output, PTAL.
I found that the DNS resolution failure is caused by disabling the forwarding of the bridge HNSNetwork. The DNS resolution seems well after enabling the forwarding, thanks @daschott.
Was there a command to diagnose and fix?
What happened:
DNS resolution failure in Windows Pod.
What you expected to happen:
Make cluster DNS resolution work.
How to reproduce it (as minimally and precisely as possible):
ping google.com
but fail to DNS resolution failure.Anything else we need to know?:
Cluster Info
Verification
ipconfig /all
inside Pod, checked, the DNS servers and suffix search list are expected.Resovle-DnsName -Name google.com -Server <in-cluster DNS Pod/Service IP>
, checked, failed to resolve.Resolve-DnsName -Name google.com -Server 8.8.8.8
, checked, succeed to resolve.Get-HNSEndpoint | ? IPAddress -eq <in-cluster DNS Pod IP>
, checked, both two endpoints existed.Get-HnsPolicyList | Where-Object { $_.Policies[0].Protocol -eq 17 } | Where-Object { $_.Policies[0].VIPs[0] -eq <in-cluster DNS Service IP> } | Select-Object -Property References
, checked, the policy referred to the above two endpoints.Process
The first time I met this issue was in
10.0.17763.379
, as https://github.com/Azure/AKS/issues/1029#issuecomment-519647927 said, it was being fixed in10.0.17763.557
. At this time, I plan to solve it with some upgrades. Unfortunately, I tried with UBR1728/1757/1790
, nothing changed.According to
vfpctrl /list-nat-range /port xxx
, I found that there are not NAT RANGE rules In the issued workers. So I restarted thehns
Windows service, and then the NAT RANGE rules are created. Now the Pod's DNS resolution is work.When I scaled up the workload to create another new Pod in the same issued worker, things are starting to get a little weird, the newly created Pod is still failing in DNS resolution. It seems this issue is caused by the CNI plugin. So I change the CNI configuration and use the latest
win-bridge
to try, again, no luck.Finally, I caught a clue after rebuilding the CNI plugin to insert an interruption between HNSEndpoint sharing.
After the CNI plugin creates an HNSEndpoint.
After the CNI plugin attaches the sandbox(pause) container to the new HNSEndpoint.
After the CNI plugin shares the new HNSEndpoint to the business container, there are two NAT RANGE rules have been applied.
After a while, the VFP refreshes/cleans these rules.
Restart the
hns
Windows service, these rules are back.Restarting
winnat
Windows service is still nothing to do with this.Environment:
kubectl version
): v1.16.6cat /etc/os-release
): Windows Server 2019uname -a
): 10.0.17763.1637