solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy
https://docs.solo.io/
Apache License 2.0
4.07k stars 434 forks source link

Kube2e Suite Test Flakes (Gateway) #9793

Open inFocus7 opened 1 month ago

inFocus7 commented 1 month ago

Which tests failed?

Umbrella for flakes seen in kube2e gateway tests.


main regression tests (gateway, v1.25.16@sha256:5da [FAIL] Kube2e: gateway tests with virtual service with a mix of valid and invalid routes on a single virtual service route prefix is invalid (selector delegation) [It] invalid route delegated via selector does not prevent updates to valid routes

• [FAILED] [4.017 seconds]
Kube2e: gateway tests with virtual service with a mix of valid and invalid routes on a single virtual service route prefix is invalid (selector delegation) [It] invalid route delegated via selector does not prevent updates to valid routes
/home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:645

  Timeline >>
  STEP: the valid route should return the expected direct response @ 07/19/24 05:15:45.977
  STEP: the RT should be updated to return a direct response @ 07/19/24 05:15:46.104
  No resources found in gloo-system namespace.
  No resources found in gloo-system namespace.
  [FAILED] in [It] - /home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:688 @ 07/19/24 05:15:48.666
  << Timeline

  [FAILED] Unexpected error:
      <*errors.withStack | 0xc001459008>: 
      updating kube resource good-rt:2397 (want 2397): admission webhook "gloo.gloo-system.svc" denied the request: resource incompatible with current Gloo snapshot: [Validating *v1.RouteTable failed: 1 error occurred:
        * Validating *v1.RouteTable failed: validating *v1.RouteTable name:"good-rt"  namespace:"gloo-system": 1 error occurred:
        * failed gloo validation resource reports: 2 errors occurred:
        * invalid resource gloo-system.gateway-proxy
        * upstream group not found, (Name: test, Namespace: gloo-system)

      ]
      {
          error: <*errors.withMessage | 0xc000e7d380>{
              cause: <*errors.StatusError | 0xc0012a9360>{
                  ErrStatus: {
                      TypeMeta: {Kind: "", APIVersion: ""},
                      ListMeta: {
                          SelfLink: "",
                          ResourceVersion: "",
                          Continue: "",
                          RemainingItemCount: nil,
                      },
                      Status: "Failure",
                      Message: "admission webhook \"gloo.gloo-system.svc\" denied the request: resource incompatible with current Gloo snapshot: [Validating *v1.RouteTable failed: 1 error occurred:\n\t* Validating *v1.RouteTable failed: validating *v1.RouteTable name:\"good-rt\"  namespace:\"gloo-system\": 1 error occurred:\n\t* failed gloo validation resource reports: 2 errors occurred:\n\t* invalid resource gloo-system.gateway-proxy\n\t* upstream group not found, (Name: test, Namespace: gloo-system)\n\n\n\n\n\n]",
                      Reason: "",
                      Details: {
                          Name: "good-rt",
                          Group: "gateway.solo.io",
                          Kind: "RouteTable",
                          UID: "",
                          Causes: [
                              {
                                  Type: "",
                                  Message: "Error Validating *v1.RouteTable failed: 1 error occurred:\n\t* Validating *v1.RouteTable failed: validating *v1.RouteTable name:\"good-rt\"  namespace:\"gloo-system\": 1 error occurred:\n\t* failed gloo validation resource reports: 2 errors occurred:\n\t* invalid resource gloo-system.gateway-proxy\n\t* upstream group not found, (Name: test, Namespace: gloo-system)\n\n\n\n\n\n",
                                  Field: "",
                              },
                          ],
                          RetryAfterSeconds: 0,
                      },
                      Code: 400,
                  },
              },
              msg: "updating kube resource good-rt:2397 (want 2397)",
          },
          stack: [0x434526c, 0x4345105, 0x5d358f2, 0x1f8b39d, 0x1f8a376, 0x4ec0a6a, 0x4ec1c68, 0x4ebe085, 0x5d35554, 0x6381f99, 0x6381dde, 0x4f6b40f, 0x4f8a9cc, 0x1eb62a1],
      }
  occurred
  In [It] at: /home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:688 @ 07/19/24 05:15:48.666

  Full Stack Trace
    github.com/solo-io/gloo/test/kube2e/gateway_test.init.func1.6.4.3.2()
        /home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:688 +0x478

main regression tests (gateway, v1.29.2@sha256:51a [FAIL] Robustness tests Updates Envoy endpoints, even if proxy is invalid works, even with deleted services [It] works, even with deleted services

• [FAILED] [80.168 seconds]
Robustness tests Updates Envoy endpoints, even if proxy is invalid works, even with deleted services [It] works, even with deleted services
/home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:406

  Timeline >>
  STEP: assert we can route to svc1 @ 07/19/24 05:16:33.609
  STEP: assert we can not route to svc2 @ 07/19/24 05:16:41.232
  STEP: bounce gloo and envoy @ 07/19/24 05:16:46.584
  STEP: assert we can route to svc1 @ 07/19/24 05:17:20.833
  No resources found in gloo-system namespace.
  No resources found in gloo-system namespace.
  [FAILED] in [It] - /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:393 @ 07/19/24 05:17:53.384
  << Timeline

  [FAILED] Timed out after 30.000s.
  The function passed to Eventually failed at /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:377 with:
  Timed out after 30.000s.
  The function passed to Eventually returned the following error:
      <*errors.errorString | 0xc0016ed5e0>: 
      *   Trying 10.96.152.62...
      * TCP_NODELAY set
      * connect to 10.96.152.62 port 80 failed: Connection refused
      * Failed to connect to gateway-proxy port 80: Connection refused
      * Closing connection 0
      command terminated with exit code 7
       (exit status 7)
      {
          s: "*   Trying 10.96.152.62...\n* TCP_NODELAY set\n* connect to 10.96.152.62 port 80 failed: Connection refused\n* Failed to connect to gateway-proxy port 80: Connection refused\n* Closing connection 0\ncommand terminated with exit code 7\n (exit status 7)",
      }
  In [It] at: /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:393 @ 07/19/24 05:17:53.384

  Full Stack Trace
    github.com/solo-io/gloo/test/kube2e/gateway_test.init.func2.6.3.7()
        /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:393 +0x1ee
    github.com/solo-io/gloo/test/kube2e/gateway_test.init.func2.6.3.9()
        /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:420 +0x526
------------------------------
SSSSSSSSSSSS

Summarizing 1 Failure:
  [FAIL] Robustness tests Updates Envoy endpoints, even if proxy is invalid works, even with deleted services [It] works, even with deleted services

Initial Investigation

No response

Additional Information

inFocus7 commented 1 month ago

This first here may be handled through the validation epic in EE

bewebi commented 1 month ago

Encountered a different flake in the Gateway action here

[FAIL] Robustness tests Updates Envoy endpoints, even if proxy is invalid [It] works

• [FAILED] [31.495 seconds]
Robustness tests Updates Envoy endpoints, even if proxy is invalid [It] works
/home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:225

  Timeline >>
  STEP: Ensure we can route to the service @ 07/22/24 13:03:11.354
  STEP: force proxy into warning state @ 07/22/24 13:03:14.83
  STEP: force an update of the service endpoints @ 07/22/24 13:03:16.969
  No resources found in gloo-system namespace.
  No resources found in gloo-system namespace.
  [FAILED] in [It] - /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:253 @ 07/22/24 13:03:42.557
  << Timeline

  [FAILED] Timed out after 20.000s.
  Expected
      <[]string | len:1, cap:1>: ["10.244.0.197"]
  not to be equivalent to
      <[]string | len:1, cap:1>: ["10.244.0.197"]
  In [It] at: /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:253 @ 07/22/24 13:03:42.557

  Full Stack Trace
    github.com/solo-io/gloo/test/kube2e/gateway_test.init.func2.6.2()
        /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:253 +0x96e
------------------------------
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

Summarizing 1 Failure:
  [FAIL] Robustness tests Updates Envoy endpoints, even if proxy is invalid [It] works
  /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:253

Ran 1 of 40 Specs in 96.300 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 39 Skipped

Seen on #9799, with only CLI code changes, targeting 1.18

In the test code this occurs here

            By("force an update of the service endpoints")
            scaleDeploymentTo(resourceClientset.KubeClients(), appDeployment, 0)
            scaleDeploymentTo(resourceClientset.KubeClients(), appDeployment, 1)

            Eventually(func() []string {
                return endpointIPsForKubeService(resourceClientset.KubeClients(), appService)
            }, 20*time.Second, 1*time.Second).Should(And(
                HaveLen(len(initialEndpointIPs)),
                Not(BeEquivalentTo(initialEndpointIPs)),
            ))

It's possible the timeout needs to be bumped, but I'd guess there's some race condition or unexpected behavior that led to the flake

bewebi commented 1 month ago

Another flake, this time on a 1.15 backport, so possible it's resolved I found one other reference in a closed issue here I'm not clear if it has been resolved in more recent versions

[FAIL] Kube2e: gateway with subsets and upstream groups [It] routes to subsets and upstream groups

• [FAILED] [9.063 seconds]
Kube2e: gateway with subsets and upstream groups [It] routes to subsets and upstream groups
/home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:1609

  [FAILED] Expected
      <string>: "blue-pod"

  to contain substring
      <string>: red-pod
  In [It] at: /home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:1679 @ 07/23/24 15:02:44.795

  Full Stack Trace
    github.com/solo-io/gloo/test/kube2e/gateway_test.glob..func1.12.3()
        /home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:1679 +0x662
------------------------------
SSSSSSSS

Summarizing 1 Failure:
  [FAIL] Kube2e: gateway with subsets and upstream groups [It] routes to subsets and upstream groups
  /home/runner/work/gloo/gloo/test/kube2e/gateway/gateway_test.go:1679

Ran 31 of 39 Specs in 212.647 seconds
FAIL! -- 30 Passed | 1 Failed | 0 Pending | 8 Skipped

Seen on #9806, targeting 1.15

Found in 1.15 weeklies: https://github.com/solo-io/gloo/actions/runs/10556065888/job/29240870402

sheidkamp commented 4 weeks ago

Hit a robustness test flake in 1.17

 [FAILED] in [It] - /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:393 @ 08/09/24 21:08:03.237
  << Timeline

  [FAILED] Timed out after 30.000s.
  The function passed to Eventually failed at /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:377 with:
  Timed out after 30.000s.
  The function passed to Eventually returned the following error:
      <*errors.errorString | 0xc0003c4c80>: 
      *   Trying 10.96.36.156...
      * TCP_NODELAY set
      * connect to 10.96.36.156 port 80 failed: Connection refused
      * Failed to connect to gateway-proxy port 80: Connection refused
      * Closing connection 0
      command terminated with exit code 7
       (exit status 7)
      {
          s: "*   Trying 10.96.36.156...\n* TCP_NODELAY set\n* connect to 10.96.36.156 port 80 failed: Connection refused\n* Failed to connect to gateway-proxy port 80: Connection refused\n* Closing connection 0\ncommand terminated with exit code 7\n (exit status 7)",
      }
  In [It] at: /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:393 @ 08/09/24 21:08:03.237

  Full Stack Trace
    github.com/solo-io/gloo/test/kube2e/gateway_test.init.func2.6.3.7()
        /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:393 +0x1ee
    github.com/solo-io/gloo/test/kube2e/gateway_test.init.func2.6.3.9()
        /home/runner/work/gloo/gloo/test/kube2e/gateway/robustness_test.go:420 +0x526
------------------------------