K8SSAND-933 ⁃ Investigate running e2e tests on self-hosted GH runners

jdonenine commented 3 years ago

What is missing?

A capability to run more resource intensive integration tests on hardware with great capabilities than the free-tier GH runners.

Why do we need it?

We have continuously faced stability issues with our larger multi-cluster integration tests that we hope can be resolved by using a runner with greater resources available to it.

This first investigation will be to confirm that system resources are indeed the source of our instability. We can take the simplest route possible of a dedicated VM configured to work with GHAs and target it with our multi-cluster integration tests - all other tests should continue to run on the free-tier runners.

If we can establish success with this we will next look for creative ways to create a pool of self-hosted runners sufficient to fit the parallel execution needs that we have while trying to account for the added cost this would bring.

┆Issue is synchronized with this Jira Task by Unito ┆Issue Number: K8SSAND-933 ┆Priority: Medium

jdonenine commented 3 years ago

Some potentially useful info here that @burmanm mentioned earlier: https://github.blog/changelog/2021-09-20-github-actions-ephemeral-self-hosted-runners-new-webhooks-for-auto-scaling/

Looks like they are working towards making it easier to only run the runners as needed. We can still for this first effort take the easy path, but might be worth a quick look to see how it works at least.

jsanda commented 3 years ago

I changed the title of this issue from integration tests to e2e tests since there is a distinction to be made within the tests, and this issue is about e2e tests, not integration tests.

Integration tests are tests that use envtest. They run against an actual api server and etcd but not in a full blown cluster.

jsanda commented 3 years ago

I also want to point out that the goal for this ticket is not to have e2e tests running against other k8s distros but rather having existing tests still running against kind on a machine with more resources than what the GHA free runner offers.

We have #112 to add some initial support for running against different k8s distros.

Assuming that the test instability is due to lack of resources, we need to use a machine instance type with sufficient resources but not too much. I say that because we should still be able to run these tests locally as well. Let me know if this doesn't make sense :)

Lastly, we should limit the tests that run on self-hosted runners to those that absolutely need the additional resources above and beyond what the free runner provides.

Miles-Garnsey commented 3 years ago

Assuming that the test instability is due to lack of resources, we need to use a machine instance type with sufficient resources but not too much. I say that because we should still be able to run these tests locally as well. Let me know if this doesn't make sense :)

In the short term would there be benefits to running all tests on both a high-resource and a low-resource runner? This might help us to detect tests which flake when resource constrained. Once we've diagnosed them all we could revert back to a low resource runner for all tests.

jsanda commented 3 years ago

I think we already have the low-resource runner covered with the existing free tier runner :) I want to run as many tests as we can there since 1) we get the parallelization for free, 2) we don't have to maintain environments, 3) security is less of a concern.

Miles-Garnsey commented 3 years ago

Sure, I just mean to say that it would be good to continue running all tests there in the short term, as well as on whatever new infra we set up. Comparing results across both might be insightful.

jsanda commented 3 years ago

Once we have a stable environment for flaky tests (assuming limited resources is the issue) and we have more reliable test runs, I could see the benefit of still running the bigger tests with the free runner. Trying to make it work has definitely helped us identify different issues.

jdonenine commented 3 years ago

@Miles-Garnsey can you give us an update on where we're at with the attempt at the self-hosted runners?

It looks liek this issue got closed out by the test PR in your fork, I'm guessing we don't want this close yet, right?

What we're hoping to do is be able to run this test:

https://github.com/k8ssandra/k8ssandra-operator/blob/main/.github/workflows/kind_multicluster_e2e_tests.yaml

On the new runner that you're getting attached to the org, is that now possible with what you've found to this point?

Miles-Garnsey commented 3 years ago

@jdonenine I was able to get tests running on cass-operator and they produced the same behaviour that I saw locally (which was still failures due to the cert-manager webhook not being ready, but at least it is a consistent failure).

I placed further work on moving this over to k8ssandra-operator on hold because it was chewing through time I wanted to use to close out other issues and it sounded like it was no longer a priority - @jsanda is that correct, or do we still need self hosted runners on k8ssandra-operator?

jsanda commented 3 years ago

We still need it for k8ssandra-operator but I don't consider it an immediate priority since we have tests stabilized right now. I still consider it a high priority though because we are going to need it eventually. I would like to continue making progress with it, but I don't think it needs to consume all your time.

jdonenine commented 3 years ago

Did we leave the runner up and configured? I don't see any on the org?

Miles-Garnsey commented 3 years ago

No, I didn't create it in our org because I need to be an owner. I created it on my repos.

I've stopped the GCP instance too to save $$ but I can have it back up pretty fast if you want to take a look - I think I'll need to be an org owner though.

EDIT: I'll also want to switch our PR configs over to requiring manual approvals for test runs - let me know if that is OK too.

jsanda commented 3 years ago

I'll also want to switch our PR configs over to requiring manual approvals for test runs - let me know if that is OK too.

I think manual approvals is fine, at least for now. Do you know if it is possible to do manual approval for a specific workflow?

Miles-Garnsey commented 3 years ago

Do you know if it is possible to do manual approval for a specific workflow?

I think it is per-repository, so we can't set them for specific workflows unfortunately.

k8ssandra / k8ssandra-operator

K8SSAND-933 ⁃ Investigate running e2e tests on self-hosted GH runners #165