Closed andyzhangx closed 1 week ago
cc @CecileRobertMichon @marosset this is the reason why CSI driver does not work on Windows node now.
cc @jsturtevant @devigned We had some discussions about blocking instance metadata endpoints for Windows nodes but I can't find the issues/PRs at the moment (maybe they are in the image builder repo)?
Found the PR to block this - https://github.com/kubernetes-sigs/image-builder/pull/694
We block containers access to the wire server here: https://github.com/kubernetes-sigs/image-builder/pull/719 due to a CVE
From reading through the comments it sounds like we want to run some of the containers as ContainerAdministrator
so they can get wire server access.
From reading through the comments it sounds like we want to run some of the containers as
ContainerAdministrator
so they can get wire server access.
Oops, looks like we went with option 2 which blocks access to wireserver for ContainerAdministrator
users.
update: went with option two, adding a group. The blocks access to the wireserver for conatineradministrator and allows for adding permissions for other users/apps.
I'm not sure how to give contiainers access to wireserver without allowing all contianers running as containerdministrator access.
I spoke with @jsturtevant and I think the right course of action here is to run the csi-driver containers as HostProcess containers. We can run HostProcess containers as system accounts on the node which can be part of the security group that have wireserver access and also would not require any updates to the csi-driver binaries/logic.
I am curious how this works on Linux. Is wireserver access blocked for all containers on linux? I noticed in the deployment files that the containers use Host networking. Does that allow access to wireserver?
All traffic on port 80 is blocked. hostNetwork-enabled containers are still be able to reach wireserver.
Why does CSI driver need access to the wireserver? To clarify the fix we implemented does not block IMDS (169.256.169.254). Rather, we block Wireserver endpoint (168.63.129.16).
cc @weinong
Oops, sorry I mixed ups IMDS and wireserver. I'm not sure why IMDS access is blocked here.
@jsturtevant do we need to manually create a route for IMDS endpoints for calico? I see https://github.com/kubernetes-sigs/sig-windows-tools/blob/42d4411003b94e086356f891b278d452fc8f50e8/hostprocess/flannel/flanneld/start.ps1#L28-L31 for flannel (running with host-process containers) but not for calico.
All traffic on port 80 is blocked. hostNetwork-enabled containers are still be able to reach wireserver.
Why does CSI driver need access to the wireserver? To clarify the fix we implemented does not block IMDS (169.256.169.254). Rather, we block Wireserver endpoint (168.63.129.16).
cc @weinong
@CecileRobertMichon only Azure Disk CSI driver needs IMDS support since it needs to get zone and vm size info
I confirmed that container in aks-engine clusters have access to IMDS. I also confirmed that containers in CAPZ clusters (running both as ContainerUser and ContainerAdministrator) do not. I'll try and figure out why.
HostProcess containers and windows nodes in CAPZ cluster DO have access to IMDS so the issue appears to be in the CNI/calico config issue.
/assign
and when I start a driver pod, it cannot access api-server using kubeconfig on the windows node, error is like following:
2022-03-08T04:29:27.4405461Z stderr F I0308 04:29:27.439569 3996 azure.go:71] reading cloud config from secret kube-system/azure-cloud-provider
2022-03-08T04:29:27.4430533Z stderr F I0308 04:29:27.443053 3996 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/kube-system/secrets/azure-cloud-provider in 1 milliseconds
2022-03-08T04:29:27.4435421Z stderr F W0308 04:29:27.443542 3996 azure.go:78] InitializeCloudFromSecret: failed to get cloud config from secret kube-system/azure-cloud-provider: failed to get secret kube-system/azure-cloud-provider: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/secrets/azure-cloud-provider": dial tcp 10.96.0.1:443: connectex: A socket operation was attempted to an unreachable network.
So it's related?
and when I start a driver pod, it cannot access api-server using kubeconfig on the windows node, error is like following:
2022-03-08T04:29:27.4405461Z stderr F I0308 04:29:27.439569 3996 azure.go:71] reading cloud config from secret kube-system/azure-cloud-provider 2022-03-08T04:29:27.4430533Z stderr F I0308 04:29:27.443053 3996 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/kube-system/secrets/azure-cloud-provider in 1 milliseconds 2022-03-08T04:29:27.4435421Z stderr F W0308 04:29:27.443542 3996 azure.go:78] InitializeCloudFromSecret: failed to get cloud config from secret kube-system/azure-cloud-provider: failed to get secret kube-system/azure-cloud-provider: Get "https://10.96.0.1:443/api/v1/namespaces/kube-system/secrets/azure-cloud-provider": dial tcp 10.96.0.1:443: connectex: A socket operation was attempted to an unreachable network.
So it's related?
I suspect this may be a different issue. Are you seeing this on Windows Server 2019 or Windows Server 2022 nodes, and also CNI/configuration?
Windows nodes in CAPZ configured with calico with overlay networking (the default in CAPZ) cannot access the IMDS. I tested this with both Windows Server 2019 and Windows Server 2022. I suspect this is a limitation of overlay networking on Windows in general.
Running containers as host-process containers means the containers are on the host network which can access the IMDS endpoints.
I do want to understand why we can't access IMDS endpoints from containers with overlay networking and have asked @daschott to help investigate.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
@marosset @daschott should we keep this issue open?
@marosset @daschott should we keep this issue open?
We worked around the issue by running the CSI drivers in hostProcess containers which can access metadata. I think we should still try to understand why Windows containers in aks-engine were able to access instance metadata with overlay networking and if there is a way around that.
/remove-lifecycle rotten
I wonder if this has to do with the requirement that only Node IPs are allowed to access the IMDS. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service?tabs=windows#known-issues-and-faq
@marosset can you try to add destination based OutboundNAT? In Azure CNI you can add this as follows:
{
"Name": "EndpointPolicy",
"Value": {
"Type": "LoopbackDSR",
"IPAddress": "169.254.169.254"
}
},
In HNS endpoint you should see the following policy added, which you should be able to add as-is to the sdnoverlay CNI config.
{
"Destinations": [
"169.254.169.254"
],
"Type": "OutBoundNAT"
}
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen /lifecycle frozen
@marosset: Reopened this issue.
@marosset Should we keep this issue open? I saw in #3283 that you mentioned for why IMDS is not reachable on CAPZ:
I believe it is a limitation on overlay networking on Windows in general (not specific to calico)
let's keep open
we workaround this issue by https://github.com/kubernetes-sigs/azuredisk-csi-driver/pull/1200, if IMDS is not available, the driver would get instance type from node labels, so host process deployment is not mandatory in this case.
I0617 13:33:09.076571 5940 utils.go:77] GRPC call: /csi.v1.Node/NodeGetInfo
I0617 13:33:09.076571 5940 utils.go:78] GRPC request: {}
W0617 13:33:30.089992 5940 nodeserver.go:382] get instance type(capz-8ken-5z262) failed with: Get "http://169.254.169.254/metadata/instance?api-version=2021-10-01&format=json": dial tcp 169.254.169.254:80: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
W0617 13:33:30.092123 5940 nodeserver.go:385] fall back to get instance type from node labels
I0617 13:33:30.096506 5940 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/nodes/capz-8ken-5z262 200 OK in 4 milliseconds
I0617 13:33:30.098487 5940 nodeserver.go:431] got a matching size in getMaxDataDiskCount, VM Size: STANDARD_D4S_V3, MaxDataDiskCount: 8
/priority backlog
I wonder if this has to do with the requirement that only Node IPs are allowed to access the IMDS. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service?tabs=windows#known-issues-and-faq
@marosset can you try to add destination based OutboundNAT? In Azure CNI you can add this as follows:
{ "Name": "EndpointPolicy", "Value": { "Type": "LoopbackDSR", "IPAddress": "169.254.169.254" } },
In HNS endpoint you should see the following policy added, which you should be able to add as-is to the sdnoverlay CNI config.
{ "Destinations": [ "169.254.169.254" ], "Type": "OutBoundNAT" }
This could be a possible solution and there is a work around here.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/kind bug
[Before submitting an issue, have you checked the Troubleshooting Guide?]
What steps did you take and what happened: [A clear and concise description of what the bug is.]
What did you expect to happen:
I set up a capz cluster with
Windows Server 2019 Datacenter
node and also installed CSI driver on windows node, and CSI driver could not get instance metadata on Windows nodedetailed logs from CI: es-sigs_azuredisk-csi-driver/1054/pull-kubernetes-e2e-capz-azure-disk-windows/1498155530961555456/build-log.txt
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
): 1.22.1/etc/os-release
):