Add workflow to run tests on real hardware

MaisenbacherD commented 3 weeks ago

This PR introduces a GitHub workflow that runs all test cases under the tests directory on real hardware.

In preparation, the existing test cases are fixed, such that the new run-tests workflow completes successfully (see actions on my fork's master branch).

The infrastructure that provides the self-hosted GitHub runner with real nvme devices attached to it are provided by Western Digital Corporation. On a high-level overview, this infrastructure contains multiple storage nodes that form a Kubernetes cluster. Within this Kubernetes cluster, KubeVirt virtual machines can be spawned on demand with different hardware configurations. Those VMs are owned by different users. One user group that is separated in its own namespace is self-hosted GitHub runners.

The overall goal of this infrastructure is to provide device access to open source projects for testing purposes. The first two pioneering candidates to make use of this service are nvme-cli and ZenFS (see https://github.com/westerndigitalcorporation/zenfs/pull/294).

The new self-hosted GitHub runner IaC is to be open-sourced.

If people want to see other nvme devices for testing in GitHub workflows, we are happy to discuss details on how to move forward with integrating new test devices.

Initial requirements for this self-hosted runner:

Share NVMe devices (both NVM and ZNS command sets. Western Digital Ultrastar® SN640 and Western Digital Ultrastar® ZN540 are provided)
Update the VM kernel nightly with Linus' master branch (through cron job)
Security:
- Running in a dedicated KubeVirt VM, one per repository.
- Restricted cluster network access: KubeVirt masquerade interface that only opens necessary ports.
- Privileged code execution within the VM ('gh-runner' user). The VM will be reset after each workflow run.
- Running in the gh-runner namespace with NetworkPolicies where ssh is just allowed from local IPs, https is just allowed to and from external addresses (no access to internal cluster services), access to VM image repository access just from local IPs.
- /etc/hosts.allow and /etc/hosts.deny rules for the gh-runner instances that only allow sshd access from cluster local IPs into the VM.

IMPORTANT: Configuration required by the GitHub repo that uses the self-hosted runner:

A single self-hosted runner instance (Kubevirt VM) is only allowed to serve one repository
Optionally: Prevent a group of people from allowing actions: Repo -> Settings -> Actions -> General -> Actions permissions
Prevent PR's from executing code on the self-hosted runner before getting approved: Repo -> Settings -> Actions -> General -> 'Require approval for all outside collaborators' -> Save
Restrict write access to the repository with the GITHUB_TOKEN: Repo -> Settings -> Actions -> General -> 'Read repository content and packages permissions' -> Save
Preventing GitHub Actions from creating or approving pull requests through the GITHUB_TOKEN: Repo -> Settings -> Actions -> General -> DISABLE 'Allow GitHub Actions to create and approve pull requests' -> Save
In reviews watch out for code injection and secret leaks within workflows (https://docs.github.com/en/actions/security-guides/security-hardening-for-github-actions)

In a GitHub workflow that uses a container on the self-hosted instance, the block device has to be passed into the container:

...
    container:
      image: ghcr.io/igaw/linux-nvme/debian.python:latest
      options: '--device=/dev/nvme0n1:/dev/nvme0n1'
...

Before this PR gets merged we should make sure that the self-hosted runner is added to the repository, which requires a token exchange on a side channel. :)

igaw commented 3 weeks ago

Looks really good already. I think we should start using the json output for the new code and transform the old code afterwards when the CI is up and running. But if you think you would like first to get it running as it is and then change the tests accordingly, that's also fine with me.

MaisenbacherD commented 3 weeks ago

Looks really good already. I think we should start using the json output for the new code and transform the old code afterwards when the CI is up and running. But if you think you would like first to get it running as it is and then change the tests accordingly, that's also fine with me.

Thanks for the review :) I changed the sections you commented on. Letting the tests run on my branch now. I think it makes sense to use a container within the VM instead of running directly in the VM. I will now experiment a bit with that before pushing this update here.

MaisenbacherD commented 3 weeks ago

The checkpatch is failing with the following:

-------------------------------------------------------------
[13/13] Check commit - b75fb594d323ca1f5facf47c9062037969037bb4
-------------------------------------------------------------
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#18:
new file mode 100644

I am not sure if this is applicable since I can't spot a MAINTAINERS file 🤔

igaw commented 3 weeks ago

You can ignore that checkpatch failure. We don't do the MAINTAINER file thing, so any new file added to project will trigger this message.

igaw commented 3 weeks ago

BTW, I've just remembered, that we also have a Python binding for the library which provides the constants. Maybe we could use these in future as well. Just as idea.

igaw commented 3 weeks ago

Now, I have to figure out how to enable it :)

MaisenbacherD commented 3 weeks ago

Thanks for the review and merge :) I will contact you to set up the VM and exchange the config token.

linux-nvme / nvme-cli

Add workflow to run tests on real hardware #2545