microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.54k stars 3.82k forks source link

CI CUDA job #3402

Closed StrikerRUS closed 3 years ago

StrikerRUS commented 3 years ago

Opening separate issue to discuss enabling CUDA CI job on demand as the original PR with initial CUDA support has 400+ comments. Refer to https://github.com/microsoft/LightGBM/pull/3160#issuecomment-659105695.

@guolinke Will linux-gpu-pool be used exclusively for LightGBM (CUDA) CI jobs? Or this machine is used for other purposes as well?

guolinke commented 3 years ago

It is exclusive, feel free to use it.

StrikerRUS commented 3 years ago

I think we can go the following way.

  1. Create a separate pipeline for CUDA job (https://sethreid.co.nz/using-multiple-yaml-build-definitions-azure-devops/).
  2. Mark it as non-required and disable auto-builds (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#run-pull-request-validation-only-when-authorized-by-your-team).
  3. Setup comments triggers (https://docs.microsoft.com/en-us/azure/devops/pipelines/repos/github?view=azure-devops&tabs=yaml#comment-triggers). Now collaborators will be able to run CUDA builds only when it is really needed by commenting something like /azp run cuda-builds.
  4. Use NVIDIA Docker containers similarly we are using Ubuntu 14.04 container for compatibility purposes right now.

I've made some progress with this in .vsts-ci.yml of the test branch: https://github.com/microsoft/LightGBM/blob/test/.vsts-ci.yml, but it looks like there are some issues with NVIDIA drivers:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.

@guolinke Could you please help to install NVIDIA drivers to the machine? I'm not sure but it might help to automate the process: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux.

StrikerRUS commented 3 years ago

@guolinke

I think it will be enough to have 1 machine.

image

guolinke commented 3 years ago

@StrikerRUS these VMs are allocated on-the-fly, I am not sure can we install driver on it or not.

guolinke commented 3 years ago

I just install the gpu driver extension.

image

guolinke commented 3 years ago

let us set the max workers to 2, in case for some cocurrence jobs.

StrikerRUS commented 3 years ago

Looks like driver extension didn't help: there is no nvidia-smi utility which is normally installed with NVIDIA drivers.

Also, I found experimental option that allows to not install driver on host machine but use driver containers.

Alternatively, and as a technology preview, the NVIDIA driver can be deployed through a container. https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver

Alternatively, the NVIDIA driver can be deployed through a container. https://github.com/NVIDIA/nvidia-docker/wiki#how-do-i-install-the-nvidia-driver

https://github.com/NVIDIA/nvidia-docker/wiki/Driver-containers#ubuntu-1804

Unfortunately, driver containers also requires rebooting:

sudo reboot

So I have no idea how to configure CUDA jobs other than renting normal permanent GPU Azure machine.

guolinke commented 3 years ago

Thanks @StrikerRUS , Maybe we can use self-hosted github action agents. I used it before, which can use an permanent VM for CI jobs. I will try to build one in the next week.

guolinke commented 3 years ago

just create an runner image

you can have try. the driver and docker is installed. Also, I fix the setup-python according to https://github.com/actions/setup-python#linux

StrikerRUS commented 3 years ago

Amazing! Just got it work!

Will read more about GitHub Actions self-hosted runners and get back with new proposals in a few days.

https://github.com/microsoft/LightGBM/runs/1173237869?check_suite_focus=true image

StrikerRUS commented 3 years ago

Hmm, seems that it is possible to use on demand allocation of new VMs with each trigger: https://github.com/AcademySoftwareFoundation/tac/issues/156 https://github.com/jfpanisset/cloud_gpu_build_agent

Or probably it will be good (at least easier) to have one permanent VM with installed drivers and docker, but turn it on and off automatically with new builds.

StrikerRUS commented 3 years ago

@guolinke I'm afraid we cannot run tests with NVIDIA Tesla M60.

[LightGBM] [Fatal] [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

terminate called after throwing an instance of 'std::runtime_error'
  what():  [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49

/LightGBM/docker-script.sh: line 12:  1861 Aborted                 (core dumped) python /LightGBM/examples/python-guide/simple_example.py

https://en.wikipedia.org/wiki/CUDA#GPUs_supported

image

https://github.com/microsoft/LightGBM/blob/79d288a32db3b124c39cbe40c1ab0c18647595d1/CMakeLists.txt#L159

I'm adding all architectures from 6.0 onward. 6.0 is needed because of the way atomics are handled. https://github.com/microsoft/LightGBM/pull/3160#discussion_r470572587

guolinke commented 3 years ago

I see. I will change it to p100 or p40

guolinke commented 3 years ago

Now it is p100

StrikerRUS commented 3 years ago

@guolinke

Now it is p100

Thank you!

Is there any similar to AWS G4 machines in Azure? It will probably cost less: https://github.com/dmlc/xgboost/issues/4881#issuecomment-534322162 https://github.com/dmlc/xgboost/issues/4921#issuecomment-540244581

guolinke commented 3 years ago

The only other option is p40, which provides more GPU memories, but slightly slower. The cost is the same. So I choose p100.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.