kubernetes-sigs / gateway-api

Repository for the next iteration of composite service (e.g. Ingress) and load balancing APIs.
https://gateway-api.sigs.k8s.io
Apache License 2.0
1.87k stars 485 forks source link

Conformance tests can be flaky in some edge cases #3233

Open robscott opened 4 months ago

robscott commented 4 months ago

What would you like to be added: Some changes to conformance framework to reduce potential flakiness. Solutions might include:

  1. A configurable timeout between tests to account for this kind of flakiness (depends on how long it takes for config to propagate in underlying implementation)
  2. Reusing Gateways less across different tests
  3. Encouraging each test to have unique path matchers (or any other kind of matcher)

Also open to any other alternatives.

Why this is needed: As we're submitting a conformance report for GKE (#3230), we found that the simplest reproduction steps could be flaky. This is because the features we support result in a unique and somewhat problematic sequence of tests running. We go from simple-same-namespace:

https://github.com/kubernetes-sigs/gateway-api/blob/58d509872d5a58dc3ebee5980aa5b71239f46fc9/conformance/tests/httproute-simple-same-namespace.yaml#L7-L12

to weighted backends:

https://github.com/kubernetes-sigs/gateway-api/blob/58d509872d5a58dc3ebee5980aa5b71239f46fc9/conformance/tests/httproute-weight.yaml#L7-L19

Importantly both tests are using the same Gateway, matching criteria (any), and primary Service. This means that if the routing configuration hasn't propagated quite yet, it will just look like we're not traffic splitting, and thus result in a flaky failure.

wstcliyu commented 3 months ago

It would be good if we can add the opinion for long term solution to this issue.

https://github.com/kubernetes-sigs/gateway-api/pull/3243#discussion_r1704590633

mlavacca commented 3 months ago

I think the long-term solution would be to improve isolation between tests. Every test should create its own resources and clean them up once the test finishes. A side effect of this approach would be that we can control how many resources are required by the suite: At the moment, there are 3-4 long-living gateways that are provisioned upfront + gateways that are created by specific tests. This means that a cluster has to tackle at least 4-5 gateways at a time. Providing test isolation gives the implementation control over the cardinality of the concurrent Gateways created and provisioned.

youngnick commented 3 months ago

I don't think we ever got around to implementing it, but I remember discussing at some point that having a smaller number of created Gateways in the tests meant that it would be possible to pre-create the Gateways, and have the conformance tests check that each Gateway matched, allowing for implementations that need a long time to provision Gateways ( or who want to limit the number of Gateways provisioned) to still pass conformance.

Just wanted to make sure we don't lose that context when we solve this issue - which I absolutely agree we should.

shaneutt commented 2 months ago

/triage needs-information