Open robscott opened 4 months ago
It would be good if we can add the opinion for long term solution to this issue.
https://github.com/kubernetes-sigs/gateway-api/pull/3243#discussion_r1704590633
I think the long-term solution would be to improve isolation between tests. Every test should create its own resources and clean them up once the test finishes. A side effect of this approach would be that we can control how many resources are required by the suite: At the moment, there are 3-4 long-living gateways that are provisioned upfront + gateways that are created by specific tests. This means that a cluster has to tackle at least 4-5 gateways at a time. Providing test isolation gives the implementation control over the cardinality of the concurrent Gateways created and provisioned.
I don't think we ever got around to implementing it, but I remember discussing at some point that having a smaller number of created Gateways in the tests meant that it would be possible to pre-create the Gateways, and have the conformance tests check that each Gateway matched, allowing for implementations that need a long time to provision Gateways ( or who want to limit the number of Gateways provisioned) to still pass conformance.
Just wanted to make sure we don't lose that context when we solve this issue - which I absolutely agree we should.
/triage needs-information
What would you like to be added: Some changes to conformance framework to reduce potential flakiness. Solutions might include:
Also open to any other alternatives.
Why this is needed: As we're submitting a conformance report for GKE (#3230), we found that the simplest reproduction steps could be flaky. This is because the features we support result in a unique and somewhat problematic sequence of tests running. We go from
simple-same-namespace
:https://github.com/kubernetes-sigs/gateway-api/blob/58d509872d5a58dc3ebee5980aa5b71239f46fc9/conformance/tests/httproute-simple-same-namespace.yaml#L7-L12
to weighted backends:
https://github.com/kubernetes-sigs/gateway-api/blob/58d509872d5a58dc3ebee5980aa5b71239f46fc9/conformance/tests/httproute-weight.yaml#L7-L19
Importantly both tests are using the same Gateway, matching criteria (any), and primary Service. This means that if the routing configuration hasn't propagated quite yet, it will just look like we're not traffic splitting, and thus result in a flaky failure.