pangeo-data / pangeo-integration-tests

Integration testing for the Pangeo cloud ecosystem
Apache License 2.0
1 stars 2 forks source link

Develop detailed instructions for setting up kubernetes cluster for running these tests #4

Open rabernat opened 3 years ago

rabernat commented 3 years ago

As described by @nbren12 in https://github.com/pangeo-data/pangeo-integration-tests/issues/1#issuecomment-839063503, we will run these tests as a cron job in a kubernetes cluster. Something like this:

apiVersion: batch/v1
kind: CronJob
metadata:
  generateName: pangeo-integration-test
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: accountWithNeededPrivs
          containers:
          - name: hello
            image: pangeo-integration-test:latest
            imagePullPolicy: IfNotPresent
            # all logic happens here
            # and publishes a integration report to a static webpage
            command: ["/bin/bash", "run_tests.sh"]

          restartPolicy: OnFailure

How should this kubernetes cluster be configured? How does one set up such a cluster from scratch on AWS, GCloud, and Azure? Could we use terraform? Can we run it in existing Pangeo clusters?

The easier we can make this step, the easier it will be to get different testing environments set up.

rabernat commented 3 years ago

A related question is: how do we want to handle Dask. Is a distributed cluster required? Can we get away with a LocalCluster, or do we need an actual Dask Kubernetes / Gateway cluster?

nbren12 commented 3 years ago

Once #3 is pushing images, I can start experimenting with our cluster, and try to figure out what if any specific configurations need to change.

nbren12 commented 3 years ago

how do we want to handle Dask.

Since the cronjob will be running in a pod on the cluster it should be possible to use https://kubernetes.dask.org/en/latest/kubecluster.html#dask_kubernetes.KubeCluster. I did a few K8s jobs like this a while ago, and can dig up the yamls. My recollection is that this worked pretty well.

nbren12 commented 3 years ago

Whether the dask/k8s integration is something we need to test though, I don't know. There are other ways to (e.g. dataflow) to put these libraries through some serious workloads if that is what we want.

rabernat commented 3 years ago

We also may be able to reproduce the main failure modes through a local distributed cluster.

Edit: my feeling is that if we put this on a big machine (64 GB RAM / 16 cores) and use a LocalCluster with processes, we can get pretty far without having to mess with dask_kubernetes.

nbren12 commented 3 years ago

Does github actions have nodes of that magnitude?

rabernat commented 3 years ago

No definitely not. That would have to be inside kubernetes. Just saying we don't necessarily need _daskkubernetes.

nbren12 commented 3 years ago

Ok. It would be cool if this repo provided a simple helm install or kustomization based template that we could use to install the cronjob into an existing cluster. We can start with a single k8s manifest and scale up from there.

rabernat commented 3 years ago

I don't have enough knowledge to contribute much here, and in general we are quite short handed in Pangeo in terms of devops expertise right now.

Thus a helm chart or other "for dummies" level instructions will be very helpful. Thanks Noah for taking the lead here.

yuvipanda commented 3 years ago

How about we run a github self-hosted runner on a Kubernetes cluster that also has dask gateway configured? That way, we get all the goodness of GitHub actions without having to deal with nasty auth stuff.

@rabernat if you have credits, I can try setup a 64G / 16 core self hosted runner

yuvipanda commented 3 years ago

or @nbren12 could too! doesn't have to be me :) But I think that's a neat way to get started here. And then we can just trigger action on a schedule so it keeps running.

rabernat commented 3 years ago

I do have credits! I would like to set up a new Google Cloud project for this, and to do that I need to go through my university to access the billing account linked to our NSF funding. May take a few days.

nbren12 commented 3 years ago

@yuvipanda That sounds like a good idea! I didn't know about github self hosted runners. Would this allow for multiple different clusters to all trigger based on PRs from this repo? The federated model is something I would like to pursue, without tying this repo to one specific cluster.

That way, we get all the goodness of GitHub actions without having to deal with nasty auth stuff.

Why is dask gateway required for this? This is why I proposed running these tests within the cluster, so that no additional authentication/networking patterns are needed beyond K8s RBACs, which the person installing needs to have anyway. Provided the runner is in a Pod with the correct service account, it should be able to create worker pods without needing any added authentication layers. We don't use dask gateway or jupyterhub so it would be nice to keep the footprint of these tests as small as possible.

I'm envisioning that anyone could add the tests to their cluster like this:

kubectl -f https://github.com/pangeo-data/pangeo-integration-tests/ci-runner.yaml -f site-specific-configmap.yaml

and make a PR here sharing the link to the reports.

yuvipanda commented 3 years ago

Why is dask gateway required for this?

Oh yeah, dask gateway is definitely not necessary here. You can run things with just LocalCluster on a big machine, or just have the runner in a pod in a kubernetes cluster with appropriate service accounts. Whatever needs to work.

The federated model is something I would like to pursue, without tying this repo to one specific cluster.

Absolutely. You can have many different runners, and have different workflows trigger different runners. So we can have a script or helm or kubeconfig that people can apply, and it'll spin up a GitHub actions runner that can connect back to this repo. Then we can label which tests should run there, and they'll be run on the next round. It's fairly flexible.

Agree on the simplest possible setup. In this case, dask gateway isn't what's being tested, so we don't have to use it unless we are specifically testing that.

nbren12 commented 3 years ago

Github does caution against using self hosted runners on public repos for security reasons:

https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories

This is good food for thought in general about how to do this in a secure way. Should we always assume that the code on master is is safe to run in a privileged environment?

Jenkins X and Argo CD are alternative CD systems we could look into: https://argoproj.github.io/argo-cd/, but I think the cronjob is a relatively simple way to get started that we could build on top of.

yuvipanda commented 3 years ago

GitHub very recently got a feature called environments that lets users manually approve runs if needed. However, I think if code is on master it should be considered trusted - code in PRs should not be. This is their rationale too - This is because forks of your repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

yuvipanda commented 3 years ago

I agree that something like Argo might also be nice though, if things get complex. I do want to try out GitHub environments though - at least I think they give me the one missing piece that was a big draw for moving towards something like Argo...

nbren12 commented 3 years ago

@yuvipanda I don't have a ton of bandwidth to work on this now, so if your stoked on this in the short-term, then I'd be happy to test w/e you come up with in our cluster. I'm just mindful to keep the tech stack as thin as possible since I have no idea how to configure/run any of these in cluster CI/CD frameworks. I'm excited to see what we come up with and if we can emulate this pattern in our current test suites.

With github actions I assume we wouldn't need to host any web visible http service? This is workable with things like K8s ingress, but fairly complicated and maybe cloud provider specific. I really like the idea of github workflows just sending a message to github.com when it completes.

yuvipanda commented 3 years ago

Sorry if I seemed out of line, @nbren12. I did get pretty stoked and have a lot of short term energy to put into this, since I've been meaning to find a use case for github self hosted runners. I apologize if my communication style came out wrong.

I'm just mindful to keep the tech stack as thin as possible since I have no idea how to configure/run any of these in cluster CI/CD frameworks.

Me too!

With github actions I assume we wouldn't need to host any web visible http service? This is workable with things like K8s ingress, but fairly complicated and maybe cloud provider specific.

Indeed, no network traffic comes from GitHub to the runner - the runner polls GitHub for everything. This makes the networking setup easy across cloud providers, since no ingress needs to happen!

nbren12 commented 3 years ago

I apologize if my communication style came out wrong.

Not at all! I wasn't miffed at all! Just stoked that you are interested, and I am happy to help.

rabernat commented 3 years ago

Just a little update: @yuvipanda now has credentials for a new, Columbia-owned GCP project devoted to this. Yuvi, let us know what we can do to help with the next steps.