Closed StrikerRUS closed 3 years ago
It is exclusive, feel free to use it.
I think we can go the following way.
/azp run cuda-builds
.I've made some progress with this in .vsts-ci.yml
of the test
branch: https://github.com/microsoft/LightGBM/blob/test/.vsts-ci.yml, but it looks like there are some issues with NVIDIA drivers:
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded\\\\n\\\"\"": unknown.
@guolinke Could you please help to install NVIDIA drivers to the machine? I'm not sure but it might help to automate the process: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux.
@guolinke
I think it will be enough to have 1 machine.
@StrikerRUS these VMs are allocated on-the-fly, I am not sure can we install driver on it or not.
I just install the gpu driver extension.
let us set the max workers to 2, in case for some cocurrence jobs.
Looks like driver extension didn't help: there is no nvidia-smi
utility which is normally installed with NVIDIA drivers.
Also, I found experimental option that allows to not install driver on host machine but use driver containers.
Alternatively, and as a technology preview, the NVIDIA driver can be deployed through a container. https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#how-do-i-install-the-nvidia-driver
Alternatively, the NVIDIA driver can be deployed through a container. https://github.com/NVIDIA/nvidia-docker/wiki#how-do-i-install-the-nvidia-driver
https://github.com/NVIDIA/nvidia-docker/wiki/Driver-containers#ubuntu-1804
Unfortunately, driver containers also requires rebooting:
sudo reboot
So I have no idea how to configure CUDA jobs other than renting normal permanent GPU Azure machine.
Thanks @StrikerRUS , Maybe we can use self-hosted github action agents. I used it before, which can use an permanent VM for CI jobs. I will try to build one in the next week.
just create an runner
you can have try. the driver and docker is installed. Also, I fix the setup-python according to https://github.com/actions/setup-python#linux
Amazing! Just got it work!
Will read more about GitHub Actions self-hosted runners and get back with new proposals in a few days.
https://github.com/microsoft/LightGBM/runs/1173237869?check_suite_focus=true
Hmm, seems that it is possible to use on demand allocation of new VMs with each trigger: https://github.com/AcademySoftwareFoundation/tac/issues/156 https://github.com/jfpanisset/cloud_gpu_build_agent
Or probably it will be good (at least easier) to have one permanent VM with installed drivers and docker, but turn it on and off automatically with new builds.
@guolinke I'm afraid we cannot run tests with NVIDIA Tesla M60.
[LightGBM] [Fatal] [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49
terminate called after throwing an instance of 'std::runtime_error'
what(): [CUDA] invalid device function /LightGBM/src/treelearner/cuda_tree_learner.cpp 49
/LightGBM/docker-script.sh: line 12: 1861 Aborted (core dumped) python /LightGBM/examples/python-guide/simple_example.py
https://en.wikipedia.org/wiki/CUDA#GPUs_supported
I'm adding all architectures from 6.0 onward. 6.0 is needed because of the way atomics are handled. https://github.com/microsoft/LightGBM/pull/3160#discussion_r470572587
I see. I will change it to p100 or p40
Now it is p100
@guolinke
Now it is p100
Thank you!
Is there any similar to AWS G4 machines in Azure? It will probably cost less: https://github.com/dmlc/xgboost/issues/4881#issuecomment-534322162 https://github.com/dmlc/xgboost/issues/4921#issuecomment-540244581
The only other option is p40, which provides more GPU memories, but slightly slower. The cost is the same. So I choose p100.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Opening separate issue to discuss enabling CUDA CI job on demand as the original PR with initial CUDA support has 400+ comments. Refer to https://github.com/microsoft/LightGBM/pull/3160#issuecomment-659105695.
@guolinke Will
linux-gpu-pool
be used exclusively for LightGBM (CUDA) CI jobs? Or this machine is used for other purposes as well?