Closed yondonfu closed 2 years ago
@cyberj0g Could you provide some suggestions to @JamesWanglf for how to test a local GH action runner similar to what you did for livepeer-ml? We can look into setting up the runner in a hosted environment like CW separately, but I think having the workflow functional locally would be a good first step.
Sure:
docker build . -t github-nv-runner
docker run --runtime nvidia -e ORGANIZATION=cyberj0g -e REPOSITORY=livepeer-ml -e ACCESS_TOKEN=XXXXXXXX github-nv-runner:latest
You will need to edit the Dockerfile to configure environment for LPMS, maybe reuse some steps from Dockerfiles in go-livepeer
. And port steps from circle-ci
workflow to GH workflow.
Thanks for your suggestion. @cyberj0g I will try to go through those steps.
The workflow to implement the GitHub actions for GPU test would be like this.
There were several issues.
Missing library I tried to use this docker images:
nvidia/cuda:10.2-cudnn8-devel-ubuntu18.04
But libcuvid.so
is missing in those docker images, so that I can only copy that library from the host. For this, I need to add the following commands when I run the nvidia-docker container.
sh -c 'ldconfig -p | grep cuvid'
libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1```
MIME type
In the docker container, ts
extention is recognized as TypeScript file. We need to update the MIME database before we run the lpms test. For this, I have added this step in Dockerfile.
RUN sudo echo '<?xml version="1.0" encoding="UTF-8"?><mime-info xmlns="http://www.freedesktop.org/standards/shared-mime-info"><mime-type type="video/mp2t"><comment>ts</comment><glob pattern="*.ts"/></mime-type></mime-info>'>>/usr/share/mime/packages/custom_mime_type.xml
RUN sudo update-mime-database /usr/share/mime
Use host network
During the lpms test, it will use some ports to transfer and receive streaming data, i.e. 1936, 1937, 1938 port.
For this, I use the --network host
option to set the docker container to use the host network when I run the nvidia-docker.
The whole command should be:
nvidia-docker run -it --runtime=nvidia --gpus all --network host -e ORGANIZATION=livepeer -e REPOSITORY=lpms -e ACCESS_TOKEN=XXXXXXXXX lpms-linux-runner:latest sh -c 'ldconfig -p | grep cuvid' libnvcuvid.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
Restriction on maximum number of video encoding In some cases, there is a restriction on maximum number of simultaneous NVENC video encoding sessions imposed by Nvidia to consumer-grade GPUs. To remove this restriction, I am using NVENC patch.
Tensorflow2.6.3 is not working with the specific nvidia version and cuda toolkit. Since Tensorflow2.6.3 is not working with some old version of nvidia driver and cuda toolkit, we need to install the appropriate version of nvidia and cuda.
Hi @yondonfu @JamesWanglf I am the creator of Cirun.io, "GPU Tests" caught my eye.
FWIW I'll share my two cents. I created a service for problems like these, which is basically running custom machines (including GPUs) in GitHub Actions: https://cirun.io/
It is used in multiple open source projects needing GPU support like the following:
https://github.com/pystatgen/sgkit/ https://github.com/qutip/qutip-cupy https://github.com/InsightSoftwareConsortium/ITKVkFFTBackend/blob/master/.cirun.yml
It is fairly simple to setup, all you need is a cloud account (like say AWS) and a simple yaml file describing what kind of machines you need and Cirun will spin up ephemeral machines on your cloud for GitHub Actions to run. It's native to GitHub ecosystem, which mean you can see logs/trigger in the Github's interface itself, just like any Github Action run.
Also, note that Cirun is free for Open source projects. (You only pay to your cloud provider for machine usage)
It's essential for test runner to have access to Nvidia GPU, as we rely on GPU functions more and more. We can follow a very similar approach to livepeer-ml, with GH actions self-hosted runner on CoreWeave.