Reorganise CI to have a sane number of statuses and reduce queueing

dave-tucker commented 4 years ago

PR #1397 goes some way towards this, but I think we need to do more, especially given that #1343 is on the horizon.

Currently the test matrix includes these suites:

K8s Network and Network Policy Tests (4 shards)
Control Plane Tests

We then run these 5 jobs against the product of this matrix for every pull request (assuming with v6/dual-stack on the horizon)

HA, nonHA
Shared Gateway Mode, Local Gateway Mode
v4, v6, dual-stack

Assuming all of these variants, that's 60 jobs per run :scream: Without v4/v6/dual-stack it's still 20.

Given how CI is setup now, we will continue to run even if the initial build/lint fails. Usually authors are quick to respond and force push a fix, but the old build isn't cancelled.

Proposal 1: Block running e2e tests until the initial Build has passed

As for the e2e tests, we need to simplify the matrix that we have.

Proposal 2: Move test orchestration outside of GitHub Actions This would allow for us to have, one (or at a push 2, if we can assume that dual-stack might prove out the v4/v6 only paths) top level matrix items - those that dictate the dimensions of the cluster required. Either ginkgo a bash script or make could then be responsible for running the tests with the necessary features enabled on the cluster, assuming this can be done by configuration only.

If that's not possible, and all matrix elements require to do things at install time with the cluster then we should...

Proposal 3: Move cluster creation outside of GitHub Actions This would require using something (e.g terraform, or maybe a remote driver for KIND) to prepare cloud instances that can be used for testing. We might also require using an image registry for images that are made during the PR build. The benefit here is that we're not capped with how many parallel machines we can create for testing. The drawback being that we'd have to create additional rigging for log streaming, test reporting and result aggregation.

Proposal 4: Move some testing to post-merge

There may be other options here I haven't thought of so I'd be interested in others thoughts or opinions on how we could improve the current situation.

dave-tucker commented 4 years ago

As a follow up, the following from GitHub may help illustrate the boundaries we're working within re: concurrency.

GitHub plan	Total concurrent jobs	Maximum concurrent macOS jobs
Free	20	5
Pro	40	5
Team	60	5
Enterprise	180	50

Also note that a single workflow can spawn up to 256 jobs maximum.

squeed commented 4 years ago

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

squeed commented 4 years ago

Another dumb question? Do we need v6? What about just v4 and dual-stack?

Billy99 commented 4 years ago

Another dumb question? Do we need v6? What about just v4 and dual-stack?

@squeed Per request from @trozet, I'm working on IPv6 only in CI so that we can verify that dual-stack doesn't break IPv6 only (which was actually already broken upstream and we didn't know because it is not tested).

trozet commented 4 years ago

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.

squeed commented 4 years ago

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.

I see. That's a useful signal, but perhaps we can skip noHA by default?

trozet commented 4 years ago

Perhaps a dumb question: Can we get rid of non-HA? What does that offer us?

It helps us differentiate if there is a failure due to HA functionality. If we see something consistently fail in HA, but not in noHA, we know the root cause is something to do with the HA path.

I see. That's a useful signal, but perhaps we can skip noHA by default?

sure. Now that @dave-tucker has given us some comments to use to trigger things, we could make that a trigger /run noha or something. Is that possible @dave-tucker ? Also, to cut down on retest, would it be possible to say like /retest "test case name" and have a job that fires that only runs that single test case?

squeed commented 4 years ago

We could also reduce matrix size by not doing a full cross join. For example, we probably don't need the full multiplication of (ha, noha) x (local, shared) x (v4, v6, dual-stack)

What if we did, by default,

ha, local, (v4, v6, dual-stack)
ha, shared, dual-stack

With the (possibly erroneous) assumption that if dual-stack works, v4 and v6 work for shared gateway

Is it possible to run more tests if /pkg/node is touched?

trozet commented 4 years ago

We could also reduce matrix size by not doing a full cross join. For example, we probably don't need the full multiplication of (ha, noha) x (local, shared) x (v4, v6, dual-stack)

What if we did, by default,

ha, local, (v4, v6, dual-stack)

ha, shared, dual-stack

With the (possibly erroneous) assumption that if dual-stack works, v4 and v6 work for shared gateway

Is it possible to run more tests if /pkg/node is touched?

There's a possibility we may be moving to shared gw mode in the future, in which case we could drop all but some sanity on local. That would help some.

Billy99 commented 4 years ago

My IPv6 CI PR will add some "exclude:" examples. For example I added a 'shard-ipv4' and then exclude it in IPv6 mode. Not sure how to add more tests if a particular file is touched.

danwinship commented 4 years ago

ha, shared, dual-stack

shared gateway doesn't even fully support single-stack IPv6 yet (#1141, though there have been drive-by fixes for some of that).

It will still be a while before dual-stack is even worth running automatically. And once dual-stack is working, we won't really need to run single-stack IPv6 regularly.

So I'd say for now:

noha, local, v4
ha, local, v6
ha, shared, v4

and once dual-stack is ready, replace 'ha, local, v6' with 'ha, local, dual-stack'

dave-tucker commented 4 years ago

@trozet there is no api to re-run jobs, just the entire worfklow. so /retest can't take arguments. Adding /test noHA is possible.

Someone would have to:

Create a new workflow .github/workflows/test-no-ha.yml from test.yml and just include the noHA stuff
Copy the code that we have for /retest and make a new /test action
Adjust the dispatch code to trigger the new workflow and extend it to post a comment with pass/fail back to the PR where it was requested

trozet commented 4 years ago

@dave-tucker couldnt we then simulate retesting a single e2e test by adding the same workflow as /test noHA, except /test "some ginkgo focus" and make a workflow that takes that in as an arg?

dave-tucker commented 4 years ago

The problem is that a workflow set up in that way won't replace the github statuses on the PR :cry: It writes new ones, because the trigger is no longer pull_request, it's repository_dispatch or similar.

trozet commented 4 years ago

The problem is that a workflow set up in that way won't replace the github statuses on the PR It writes new ones, because the trigger is no longer pull_request, it's repository_dispatch or similar.

That is kind of lame, but this would still be useful to be able to run another test and see aresult.

aojea commented 4 years ago

maybe it is not necessary to run all jobs in premerge, just a subset of them, the most significant. Then, run ALL of them nightly, the risks of regressions is higher but I don't see a big number of PR per day so it would be easy to catch regressions with only one day of delay. https://jasonet.co/posts/scheduled-actions/

github-actions[bot] commented 2 days ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ovn-org / ovn-kubernetes

Reorganise CI to have a sane number of statuses and reduce queueing #1401