k8snetworkplumbingwg / multus-cni

A CNI meta-plugin for multi-homed pods in Kubernetes
Apache License 2.0
2.27k stars 577 forks source link

Add e2e tests to capture connectivity issues between Pods running on Control Plane nodes and the API server #1259

Open vasrem opened 3 months ago

vasrem commented 3 months ago

This test is added to showcase that primary network is not always working as expected on Kind control plane nodes with thick plugin installed. I'm not sure if this is a broader issue. This was discovered as part of https://github.com/k8snetworkplumbingwg/multus-cni/pull/1078.

It may not reproduce very easily, but this is how it looks like on a fresh kind cluster.

$ cd multus-cni/e2e
$ ./get_tools.sh
$ ./generate_yamls.sh
$ ./setup_cluster.sh
$ ./test-simple-pod.sh
pod/simple-worker created
pod/simple-control-plane created
pod/simple-control-plane condition met
pod/simple-worker condition met
check eventual connectivity of simple-worker Pod to the Kubernetes API server
Connection to kubernetes (10.96.0.1) 443 port [tcp/https] succeeded!
simple-worker reached the Kubernetes API server
check eventual connectivity of simple-control-plane Pod to the Kubernetes API server
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
nc: getaddrinfo for host "kubernetes" port 443: Try again
command terminated with exit code 1
simple-control-plane couldn't connect to the Kubernetes API server

For the DRA PR https://github.com/k8snetworkplumbingwg/multus-cni/pull/1078, the relevant controller pod, when it's running on the control plane node, it is failing with this (indicating it's not particularly a DNS issue):

failed to list *v1alpha2.PodSchedulingContext: Get "https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: no route to host
coveralls commented 2 months ago

Coverage Status

coverage: 62.778% (-0.3%) from 63.028% when pulling 6a15e99dd3211a5010b89e047cef48c5aac50316 on vasrem:add-test-to-capture-flaky-behavior into b6206a0dbfb95af2536035532d4f587850d999c1 on k8snetworkplumbingwg:master.

s1061123 commented 2 months ago

Hi, thank you for the PR.

I understand that your CI test is failed as your github action log and you want to mean that it is the flakiness of the CI testing.

So the PR is good to show the CI failure, however, if you suppose the PR to be merged to repository,
I'm not quite sure that why multus-cni needs this test because our e2e should test to multus-cni's end-to-end, not Kubernetes end-to-end. We could expect that Kubernetes end-to-end is cared in Kubernetes e2e test/CI. So let's imagine what should we do if this test is failed. At least to us, we do nothing except to let Kubernetes/kind community to know that (i.e. file an issue in kind or k8s repo, not in multus because multus is not Kubernetes as well as not kind). It should be cared by own community because this error is outside of multus. We should care what we own component.

I understand you want to push DRA PR and you mention DRA CI is failed due to above reason, but this PR (clarify the issue) does not solve our request (provide DRA CI test in github) because it is just to clarify the error. In addition, from CI point of view, it is not clear what should we do if this error is happen. Should we skip thick plugin CI (due to frakiness)? At least to me, it should not. We need e2e test in CI to verify that current code/new code in PR does passed simple end-to-end testing. So what we really need is the way to verify following end-to-end works "FINE". "FINE" means just "PASS the test" and if not "FINE", then we expect you to fix the issue because you already know the multus code (I expect that you are contributor, not user in this context).

At the last maintainer's meeting, you mentioned that the flaky error is only happen in thick plugin, but I'm wondering that this test is passed in thin plugin mode. Did you check your current CI (in this PR) is passed in thin mode? In github action log, kind-e2e (thin) is terminated due to thick plugin CI failure and kind-e2e(thin) does not completed yet, hence we don't see whether thin plugin passed the CI or not.

https://github.com/k8snetworkplumbingwg/multus-cni/actions/runs/8658088740/job/23977433044?pr=1259

vasrem commented 2 months ago

@s1061123 thanks for your reply.

I understand you want to push https://github.com/k8snetworkplumbingwg/multus-cni/pull/1078 and you mention DRA CI is failed due to above reason, but this PR (clarify the issue) does not solve our request (provide DRA CI test in github) because it is just to clarify the error.

The DRA PR is adjusted to bypass this error and is no longer flaky while we test both thin and thick. See https://github.com/k8snetworkplumbingwg/multus-cni/pull/1078#issuecomment-2050240124.

So the PR is good to show the CI failure, however, if you suppose the PR to be merged to repository

My intention with this PR is to expose a flakiness that already exists and is not introduced by the DRA PR. I'm happy to close that PR after the DRA PR is merged if you think it's not useful. I could add an issue instead, but I thought that proving that this error exists in CI (for either kind or multus reason) with a PR, gives more value.

In addition, from CI point of view, it is not clear what should we do if this error is happen. Should we skip thick plugin CI (due to frakiness)? At least to me, it should not. We need e2e test in CI to verify that current code/new code in PR does passed simple end-to-end testing. So what we really need is the way to verify following end-to-end works "FINE". "FINE" means just "PASS the test" and if not "FINE", then we expect you to fix the issue because you already know the multus code (I expect that you are contributor, not user in this context).

  • multus (thin) with DRA
  • multus (thin) without DRA
  • multus (thick) with DRA
  • multus (thick) without DRA

I agree we can take steps to merge this PR, if we believe that this test is valuable for the multus project, by doing either of these:

However, as I mentioned above, the intention of this PR is to get the DRA PR merged. There were concerns that DRA implementation introduces bugs in the multus codebase and I added this PR to prove that the flakiness in the e2e test is not coming from the DRA PR.

I'm not quite sure that why multus-cni needs this test because our e2e should test to multus-cni's end-to-end, not Kubernetes end-to-end.

It should be cared by own community because this error is outside of multus. We should care what we own component.

At the last maintainer's meeting, you mentioned that the flaky error is only happen in thick plugin, but I'm wondering that this test is passed in thin plugin mode. Did you check your current CI (in this PR) is passed in thin mode? In github action log, kind-e2e (thin) is terminated due to thick plugin CI failure and kind-e2e(thin) does not completed yet, hence we don't see whether thin plugin passed the CI or not.

In all of the tests I've done in my setup, I haven't seen this test failing with the thin plugin. Just with the thick plugin. Therefore, I'm not sure if this is a solely a kind issue. The best I can do is expose that flakiness to you so that I can build confidence for the other PR. I haven't dug deep into the setup to understand where this is coming from and unfortunately I won't have time to dig into that.