vmware-tanzu / sonobuoy

Sonobuoy is a diagnostic tool that makes it easier to understand the state of a Kubernetes cluster by running a set of Kubernetes conformance tests and other plugins in an accessible and non-destructive manner.
https://sonobuoy.io
Apache License 2.0
2.92k stars 344 forks source link

Unable to run Sonobuoy Plugins for windows nodegroup #1551

Closed juhis135 closed 2 years ago

juhis135 commented 2 years ago

I tried running the below commands:

sonobuoy run --plugin-env=e2e.E2E_EXTRA_ARGS='--progress-report-url=http://localhost:8099/progress --node-os-distro=windows' --plugin=win-e2e-image-repo-list-master.yml --security-context-mode=none --aggregator-node-selector="beta.kubernetes.io/os:windows"

sonobuoy run --plugin 'win-e2e-image-repo-list-master.yml' --security-context-mode=none --wait --aggregator-node-selector "beta.kubernetes.io/os:windows"

Error I am getting : Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "4054eb01f3637b53bbac466e6329fee6cb30cd9aff849de7631edb12fea5dccf" network for pod "agnhost-primary-w2bd5": networkPlugin cni failed to set up pod "agnhost-primary-w2bd5_kubectl-4540" network: failed to parse Kubernetes args: pod does not have label vpc.amazonaws.com/PrivateIPv4Address

This is coming for all the pods created by Sonobuoy for running the plugins.

What did you expect to happen: The plugins should run on Windows nodegroup

Anything else you would like to add: I am using EKS1.21 with linux and windows nodes

Environment:

juhis135 commented 2 years ago

Kubernetes version: Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"windows/amd64"} Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"} WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1

johnSchnake commented 2 years ago

That sounds like a cluster issue in general; are you able to launch any pods on windows nodes via any means?

The error message doesn't have any language/keywords that indicates it is something about how Sonobuoy is launching things. Just that the network may be misconfigured on the cluster for the Windows nodes.

@jayunit100 have you seen errors like that before for Windows EKS cluster?

juhis135 commented 2 years ago

I am able to deploy other pods successfully on the windows node. I was able to deploy the sample application provided by AWS (https://docs.aws.amazon.com/eks/latest/userguide/sample-deployment.html ) as well on the windows nodes.

Its only the issue with the pods that the sonobouy creates on windows node while running the plugins.

Also, The sonobouy aggregator pods runs successfully on the windows nodes only when I pass the argument --node-aggregator-selector=beta.kubernetes.io/os=windows Without this flag, even the sonobouy pod throws the same error.

Also, I am using a mixed EKS cluster, with both Linux and windows nodes.

johnSchnake commented 2 years ago

OK, so presumably the aggregator has a problem launching on the linux nodes (by specifying the aggregator selector you said it works fine, right?).

I'm not terribly familiar with EKS error messages but I can attempt to repro at some point.

It does just seem strange since

My guess is that you dont have any linux worker nodes, only control plane nodes; is that right? Sonobuoy tolerates the normal master node taints but EKS may otherwise prevent you from launching pods there. Whatever this process is seems to keep Sonobuoy from launching and hitting these pod/labels in the error messages.

If thats the case and I can repro, then this should be put into the known issues or FAQ; that even though Sonobuoy can run on either node type, you need to ensure that the Linux nodes are schedulable or you have to provide that aggregator-node-selector flag as you demonstrated.

If its not too much trouble and you happen to have a tarball from a Sonobuoy run, it would help in understanding the situation since it contains logs and API object information.

Thanks.

juhis135 commented 2 years ago

We have 2 Linux nodes and 2 windows nodes in the EKS clusters. Even when the Sonobuoy pod is created on the linux node , the other pods that it creates for windows nodes, they end up throwing the same error.

I am a bit skeptical about sharing the whole tar file for the reason that it might contain some project info. Is there any specific file from the tar that you are looking for. I can verify and provide that specific file which you are looking for.

johnSchnake commented 2 years ago

A few that come to mind would be:

juhis135 commented 2 years ago

Hi,

It would not be possible to share the files due to project restrictions.

Will it be possible for you to reproduce this issue on your end using EKS 1.21 for windows nodes.

One of the other teams is also getting the same issue for windows node in EKS.

johnSchnake commented 2 years ago

I'm happy to try and repro. How are you able to get EKS with windows nodes? When I'm adding a node group I only have Linux and Bottlerocket choices

update: Sorry; following https://docs.aws.amazon.com/eks/latest/userguide/windows-support.html as I see it is an opt-in feature. I'll try and follow this and see how it goes.

johnSchnake commented 2 years ago

So:

At first, I found I left off this line from the instructions:

eksctl utils install-vpc-controllers --cluster my-cluster --approve

But then I deleted and recreated my windows node group and tried again. Same issue though.

However, I'll try and help resolve in one other way. These instructions were via the legacy windows support method. There should be another method that may not hit the same vpc issue. I'll let you know.

johnSchnake commented 2 years ago

Confirmed that following the instructions for windows support on EKS (not legacy windows support) worked. The IAM role tagged the pod with the ipv4 address as (apparently) expected:

              vpc.amazonaws.com/PrivateIPv4Address: 192.168.90.24/19
juhis135 commented 2 years ago

Thanks for reproducing the issue.

Just wanted to add, if you are using Kubernetes version > 1.17, you would need to follow the steps mentioned in this section "Enabling Windows support" of the AWS document "https://docs.aws.amazon.com/eks/latest/userguide/windows-support.html"

The instruction you mentioned above (eksctl utils install-vpc-controllers --cluster my-cluster --approve) is required only for kubernetes version older than 1.17.

But we are getting the same errors, following either set of instructions.

Also, I was able to run the pods deploying the sample application mentioned in the AWS document on the windows node. https://docs.aws.amazon.com/eks/latest/userguide/sample-deployment.html

It would be a great help if you could find something to resolve this issue.

dordevd1-roche commented 2 years ago

Hi all. The main issue here is the way how the EKS networking works, especially the networking of the windows workers. The related issue can be found here https://github.com/aws/containers-roadmap/issues/463 In order to resolve this issue, we need to have the following node selectors for each pod that should be scheduled on windows workers (each test case):

nodeSelector:
  kubernetes.io/os: windows
  kubernetes.io/arch: amd64

By having the above node selectors, the VPC CNI controller should be able to assign the IPs properly so the pods can run.

juhis135 commented 2 years ago

@johnSchnake - Can we please try the solution suggested by Danijel in the previous comment and update the test cases with the relevant node selectors for windows nodes.

jiechen0826 commented 2 years ago

@johnSchnake Hi, I'm experiencing the same issue here. Is the proposed solution that adding a nodeSelector section to each test pod going to be carried out by Sonobuoy side? Or is there any quick fix I can do by myself to run sonobuoy on an EKS cluster with Windows nodes? Currently, the aggregator can run on the mix-node cluster in EKS but all the tests failed due to the above error "pod does not have label vpc.amazonaws.com/PrivateIPv4Address".

jayunit100 commented 1 year ago

@jiechen0826 just filed an upstream issue here https://github.com/kubernetes/kubernetes/issues/119022