Open dave-tucker opened 4 years ago
As a follow up, the following from GitHub may help illustrate the boundaries we're working within re: concurrency.
GitHub plan | Total concurrent jobs | Maximum concurrent macOS jobs |
---|---|---|
Free | 20 | 5 |
Pro | 40 | 5 |
Team | 60 | 5 |
Enterprise | 180 | 50 |
Also note that a single workflow can spawn up to 256 jobs maximum.
Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?
Another dumb question? Do we need v6? What about just v4 and dual-stack?
Another dumb question? Do we need v6? What about just v4 and dual-stack?
@squeed Per request from @trozet, I'm working on IPv6 only in CI so that we can verify that dual-stack doesn't break IPv6 only (which was actually already broken upstream and we didn't know because it is not tested).
Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?
It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.
Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?
It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.
I see. That's a useful signal, but perhaps we can skip noHA by default?
Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?
It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.
I see. That's a useful signal, but perhaps we can skip noHA by default?
sure. Now that @dave-tucker has given us some comments to use to trigger things, we could make that a trigger /run noha or something. Is that possible @dave-tucker ? Also, to cut down on retest, would it be possible to say like /retest "test case name" and have a job that fires that only runs that single test case?
We could also reduce matrix size by not doing a full cross join. For example, we probably don't need the full multiplication of (ha, noha) x (local, shared) x (v4, v6, dual-stack)
What if we did, by default,
With the (possibly erroneous) assumption that if dual-stack works, v4 and v6 work for shared gateway
Is it possible to run more tests if /pkg/node
is touched?
We could also reduce matrix size by not doing a full cross join. For example, we probably don't need the full multiplication of (ha, noha) x (local, shared) x (v4, v6, dual-stack)
What if we did, by default,
- ha, local, (v4, v6, dual-stack)
- ha, shared, dual-stack
With the (possibly erroneous) assumption that if dual-stack works, v4 and v6 work for shared gateway
Is it possible to run more tests if
/pkg/node
is touched?
There's a possibility we may be moving to shared gw mode in the future, in which case we could drop all but some sanity on local. That would help some.
My IPv6 CI PR will add some "exclude:" examples. For example I added a 'shard-ipv4' and then exclude it in IPv6 mode. Not sure how to add more tests if a particular file is touched.
ha, shared, dual-stack
shared gateway doesn't even fully support single-stack IPv6 yet (#1141, though there have been drive-by fixes for some of that).
It will still be a while before dual-stack is even worth running automatically. And once dual-stack is working, we won't really need to run single-stack IPv6 regularly.
So I'd say for now:
and once dual-stack is ready, replace 'ha, local, v6' with 'ha, local, dual-stack'
@trozet there is no api to re-run jobs, just the entire worfklow. so /retest
can't take arguments.
Adding /test noHA
is possible.
Someone would have to:
.github/workflows/test-no-ha.yml
from test.yml
and just include the noHA stuff/retest
and make a new /test
action@dave-tucker couldnt we then simulate retesting a single e2e test by adding the same workflow as /test noHA, except /test "some ginkgo focus" and make a workflow that takes that in as an arg?
The problem is that a workflow set up in that way won't replace the github statuses on the PR :cry:
It writes new ones, because the trigger is no longer pull_request
, it's repository_dispatch
or similar.
The problem is that a workflow set up in that way won't replace the github statuses on the PR It writes new ones, because the trigger is no longer
pull_request
, it'srepository_dispatch
or similar.
That is kind of lame, but this would still be useful to be able to run another test and see aresult.
maybe it is not necessary to run all jobs in premerge, just a subset of them, the most significant. Then, run ALL of them nightly, the risks of regressions is higher but I don't see a big number of PR per day so it would be easy to catch regressions with only one day of delay. https://jasonet.co/posts/scheduled-actions/
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.
PR #1397 goes some way towards this, but I think we need to do more, especially given that #1343 is on the horizon.
Currently the test matrix includes these suites:
We then run these 5 jobs against the product of this matrix for every pull request (assuming with v6/dual-stack on the horizon)
Assuming all of these variants, that's 60 jobs per run :scream: Without v4/v6/dual-stack it's still 20.
Given how CI is setup now, we will continue to run even if the initial build/lint fails. Usually authors are quick to respond and force push a fix, but the old build isn't cancelled.
Proposal 1: Block running e2e tests until the initial Build has passed
As for the e2e tests, we need to simplify the matrix that we have.
Proposal 2: Move test orchestration outside of GitHub Actions This would allow for us to have, one (or at a push 2, if we can assume that dual-stack might prove out the v4/v6 only paths) top level matrix items - those that dictate the dimensions of the cluster required. Either
ginkgo
abash
script ormake
could then be responsible for running the tests with the necessary features enabled on the cluster, assuming this can be done by configuration only.If that's not possible, and all matrix elements require to do things at install time with the cluster then we should...
Proposal 3: Move cluster creation outside of GitHub Actions This would require using something (e.g terraform, or maybe a remote driver for KIND) to prepare cloud instances that can be used for testing. We might also require using an image registry for images that are made during the PR build. The benefit here is that we're not capped with how many parallel machines we can create for testing. The drawback being that we'd have to create additional rigging for log streaming, test reporting and result aggregation.
Proposal 4: Move some testing to post-merge
There may be other options here I haven't thought of so I'd be interested in others thoughts or opinions on how we could improve the current situation.