Switch from CircleCI to GH actions for CI and enable GPU tests

cyberj0g commented 2 years ago

It's essential for test runner to have access to Nvidia GPU, as we rely on GPU functions more and more. We can follow a very similar approach to livepeer-ml, with GH actions self-hosted runner on CoreWeave.

yondonfu commented 2 years ago

@cyberj0g Could you provide some suggestions to @JamesWanglf for how to test a local GH action runner similar to what you did for livepeer-ml? We can look into setting up the runner in a hosted environment like CW separately, but I think having the workflow functional locally would be a good first step.

cyberj0g commented 2 years ago

Sure:

Install nvidia-docker locally
Create a fork of the repo in your GH account
Put GH workflow there
Get API key with permission to manage runners
Make sure runner name in workflow matches Dockerfile

Build & run container with correct settings e.g.

docker build . -t github-nv-runner
docker run --runtime nvidia -e ORGANIZATION=cyberj0g -e REPOSITORY=livepeer-ml -e ACCESS_TOKEN=XXXXXXXX github-nv-runner:latest

Push anything to the repo
Monitor both container logs and GH action runs

You will need to edit the Dockerfile to configure environment for LPMS, maybe reuse some steps from Dockerfiles in go-livepeer. And port steps from circle-ci workflow to GH workflow.

JamesWanglf commented 2 years ago

Thanks for your suggestion. @cyberj0g I will try to go through those steps.

JamesWanglf commented 2 years ago

The workflow to implement the GitHub actions for GPU test would be like this.

Install NVIDIA driver and Cuda toolkit on the host VM.
Install docker and nvidia-docker
Build docker image and run docker container
Push anything to GitHub repository to test

There were several issues.

Missing library I tried to use this docker images:

nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04

But libcuvid.so is missing in those docker images, so that I can only copy that library from the host. For this, I need to add the following commands when I run the nvidia-docker container.
```
  sh -c 'ldconfig -p | grep cuvid' 
            libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1```
```

MIME type In the docker container, ts extention is recognized as TypeScript file. We need to update the MIME database before we run the lpms test. For this, I have added this step in Dockerfile.

  RUN sudo echo '<?xml version="1.0" encoding="UTF-8"?><mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info"><mime-type type="video/mp2t"><comment>ts</comment><glob pattern="*.ts"/></mime-type></mime-info>'>>/usr/share/mime/packages/custom_mime_type.xml
  RUN sudo update-mime-database /usr/share/mime

Use host network During the lpms test, it will use some ports to transfer and receive streaming data, i.e. 1936, 1937, 1938 port. For this, I use the --network host option to set the docker container to use the host network when I run the nvidia-docker. The whole command should be: nvidia-docker run -it --runtime=nvidia --gpus all --network host -e ORGANIZATION=livepeer -e REPOSITORY=lpms -e ACCESS_TOKEN=XXXXXXXXX lpms-linux-runner:latest sh -c 'ldconfig -p | grep cuvid' libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
Restriction on maximum number of video encoding In some cases, there is a restriction on maximum number of simultaneous NVENC video encoding sessions imposed by Nvidia to consumer-grade GPUs. To remove this restriction, I am using NVENC patch.
Tensorflow2.6.3 is not working with the specific nvidia version and cuda toolkit. Since Tensorflow2.6.3 is not working with some old version of nvidia driver and cuda toolkit, we need to install the appropriate version of nvidia and cuda.

aktech commented 2 years ago

Hi @yondonfu @JamesWanglf I am the creator of Cirun.io, "GPU Tests" caught my eye.

FWIW I'll share my two cents. I created a service for problems like these, which is basically running custom machines (including GPUs) in GitHub Actions: https://cirun.io/

It is used in multiple open source projects needing GPU support like the following:

https://github.com/pystatgen/sgkit/ https://github.com/qutip/qutip-cupy https://github.com/InsightSoftwareConsortium/ITKVkFFTBackend/blob/master/.cirun.yml

It is fairly simple to setup, all you need is a cloud account (like say AWS) and a simple yaml file describing what kind of machines you need and Cirun will spin up ephemeral machines on your cloud for GitHub Actions to run. It's native to GitHub ecosystem, which mean you can see logs/trigger in the Github's interface itself, just like any Github Action run.

Also, note that Cirun is free for Open source projects. (You only pay to your cloud provider for machine usage)

yondonfu commented 2 years ago

Closed by https://github.com/livepeer/lpms/pull/310 and https://github.com/livepeer/livepeer-infra/pull/842

livepeer / lpms

Switch from CircleCI to GH actions for CI and enable GPU tests #284