WIP POC of Setup tooling and running a workload

akrzos commented 5 years ago

POC Setup Tooling in a single repo

Only external dependency is a container image for controller/pbench agent.

DNM - Do not merge, just test/provide feedback. Thanks!

akrzos commented 5 years ago

FYI @chaitanyaenr

akrzos commented 5 years ago

@jmencak @sjug @mffiedler @chaitanyaenr Hey guys can you provide some feedback on this approach to improving our tooling and potentially to run workloads (Example being the included configurable NodeVertical job that can run with/without pbench agents running and simply ran by ansible-playbook)

jmencak commented 5 years ago

Testing right now. The first thing I've noticed is that openshift-install logs no longer go to OPENSHIFT_INSTALL_LOG_DIR that the user defines. Could be unrelated to this PR though.

sjug commented 5 years ago

Looks good to me, nothing else to add that we didn't already discuss. Once the CL fixes merge that will fix the directory/path issues that you're working around in the respective shell script.

chaitanyaenr commented 5 years ago

@jmencak About the install log, we are copying the log generated by the installer to the OPENSHIFT_INSTALL_LOG_DIR at the end of the install instead of tee'ing the stdout as it contains timestamps as well.

jmencak commented 5 years ago

Thanks for preparing this! In general, I like the idea of centralizing most of the tooling into one repo. However, is the intention to keep all files necessary for running a specific workload in this repo and this repo only? Some workloads carry quite a few files and unless (1) the run.sh script clones the whole test repo or a workload container image is used that already has the test repos baked in, this would require huge workload-<testname>-script-cm.yml.j2 files duplicating content of existing test repos. While this is certainly doable if we want to follow this path, the *.j2 files simply do not look very elegant to me, but the other approaches I can currently think of also have their disadvantages:

(1) limits the "run-on-any-cluster" due to external access to github (2) requires frequent rebuilding and probably tagging of the workload container image

Nits: to make this run, I had to:

use an old nightly 4.1.0-0.nightly-2019-05-16-090009
change the default(fales, true) -> default(false, true) for enable_pbench_agents.
set workload_job_privileged to false

akrzos commented 5 years ago

Thanks for preparing this! In general, I like the idea of centralizing most of the tooling into one repo.

Thanks Jiri!

However, is the intention to keep all files necessary for running a specific workload in this repo and this repo only?

The main intention is to reduce the burden to run a workload. Right now we all know not only does it require a cluster built by our install automation, but also the tests are spread with inter-dependencies among several repos. Also I think defining some clear boundaries on what belongs where will make it easier for all of us to run anyone else's workload. (An example would be, our workload container shouldn't be a catch all, but rather just the image that hosts the tools/binaries we need for a workload)

The process here to setup pbench and run nodevertical is greatly simplified and by using Ansible we can easily orchestrate this in Jenkins (In the same fashion as install jobs for scale-ci cluster) and/or run it from your local machine or pointed at a jump host / orchestration host. It provides great flexibility while remaining simple to run. Of course this is just one of the several workloads we have, ideally we can get all workloads into the same repo to reduce the repo sprawl that has occurred.

The other objective in this poc was to remove as many host-mounts from the workload/pbench pods. This decouples the workloads from our install process, in fact we can already eliminate the post-install-copy kubeconfig/copy ssh keys to nodes with this implementation since this implementation uses secrets to store the kubeconfig and ssh keys.

Some workloads carry quite a few files and unless (1) the run.sh script clones the whole test repo or a workload container image is used that already has the test repos baked in, this would require huge workload-<testname>-script-cm.yml.j2 files duplicating content of existing test repos. While this is certainly doable if we want to follow this path, the *.j2 files simply do not look very elegant to me, but the other approaches I can currently think of also have their disadvantages:

You bring up a good point, however I do believe the j2 file is far more elegant than cross repo inter-dependencies such as the current setup tooling job today. But we can do better than one large j2 file. I envision something more on the lines of each workload generally following the same concepts laid out here but not a super strict compliance with it. We could simply lay out only the items that require some configuration/template-ing in the j2 and place other that don't require it in a separate file with the correct extension.

(1) limits the "run-on-any-cluster" due to external access to github (2) requires frequent rebuilding and probably tagging of the workload container image

The workload container image is automatically built from a Dockerfile in this repo https://github.com/openshift-scale/images via quay - https://quay.io/repository/openshift-scale/scale-ci-workload

Nits: to make this run, I had to:

use an old nightly 4.1.0-0.nightly-2019-05-16-090009

I have ran this yesterday using 4.1.0-0.nightly-2019-05-18-050636, it might have been that the install automation here was still needing a fix on the machineset since the spec had changed, see #234

change the default(fales, true) -> default(false, true) for enable_pbench_agents.

Good catch.

set workload_job_privileged to false

Again thanks for the feedback!

jmencak commented 5 years ago

/cc @ekuric for more eyeballs, as we're likely to live with this in the future

akrzos commented 5 years ago

For continuity, this work has been shifted over to this repo - https://github.com/openshift-scale/workloads to make it easier to git clone and run it.

redhat-performance / scale-ci-ansible

WIP POC of Setup tooling and running a workload #232