Closed Finebouche closed 1 year ago
If you want to spin up multiple GPUs, please refer to example trainer script and WarpDrive managed multi gpu training pipeline which is one single line of function here, this script itself supports multiple GPUs, so you can try it first. https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/example_training_script_numba.py#L213
Sorry about the mislead. In fact, Pytorch lighting can only manage the multi-GPUs for pytorch model part, but not the environment steps part, actually you know the latter is the key of WarpDrive. We have multi-gpu managers for both env steps and pytorch models. In particular, we have model sync up with pytorch DDP and GPU context manager for step function executables running on multiple GPUs. The problem you see is simply because by default CPU host will always work with the default GPU, and some advanced settings are required to distribute the host to multiple processes and each process to control one GPU. From our benchmark on V100 and A100 GPUs, the speed is almost going linearly with N_GPUs with N_GPUs <= 4 at least.
BTW, since Pytorch lightning has very frequent update on the backend and sometimes it will fail some existing pipelines, I do not really suggest you using lightning for now.
Hi, thanks for the explanation.
I actually tried reusing that distributed function but without success. Firstly the scripts
python warp_drive/training/example_training_script_pycuda.py --env tag_continuous
and
python warp_drive/training/example_training_script_numba.py --env tag_continuous
were giving me this error :
RuntimeError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 10.92 GiB total capacity; 6.73 GiB already allocated; 1.33 GiB free; 8.34 GiB reserved in total by PyTorch)
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Which is weird to me because I would have expect that to work the same way as calling the Environment Warper and Trainer from the notebook.
And when using perform_distributed_training from the notebook I also encountered some other problems as well. I will try to reproduce the error I had and see if I can fix it but I am still puzzled by the error I get from the script.
The error you saw is simply the OOM. What you can do is to reduce the number of agents set up in the yaml configure https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/run_configs/tag_continuous.yaml
I just tested continuous demo script and it runs well on 2 A-100 GPUs, and you can see both GPUs have equal share.
Hi, Sorry for the late reply and thank you for your previous help ! I succefully run the script on a single GPU using your advice to reduce the memory usage but I still get a different error trying to use two GPUS. I have the following stack :
We have successfully found 2 GPUs!
Training with 2 GPU(s).
Starting worker process: 1
Starting worker process: 0
ERROR:root:Address already in use
Process NumbaDeviceContextProcessWrapper-1:
Traceback (most recent call last):
File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
self._clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
process_group_torch.clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
dist.destroy_process_group()
File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
assert pg is not None
AssertionError
Address already in use
Pycuda gives an extra :
ERROR:root:Address already in use
Process PyCUDADeviceContextProcessWrapper-1:
Traceback (most recent call last):
File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
self._clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
process_group_torch.clear_torch_process_group()
File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
dist.destroy_process_group()
File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
assert pg is not None
AssertionError
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------
No process is being launched in the same time and I am not sure.
I am trying to investigate the context.pop advice by looking at /child_process_base.py
but not sure if that is the right idea.
Not pretty sure about this error as it says "ERROR:root:Address already in use". It sounds to me that the previous GPU context has not been cleared, you may check if the previous task has any zombie process occupied. Since I cannot reproduce your error on my end so I cannot give more clear advice here. What is your multi GPU setup look like?
Hi, I took me some time. I am using two NVIDIA GeForce GTX 1080 Ti :
I am running my code inside a docker image instantiated using : https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/tree/master/pytorch-notebook which is a fork of the public Jupyter Docker Stacks with GPU-processing capability as explained here
I just simply use the Docker with NVIDIA maintained base image, this is the most robust image I know so far. It handles every tricky dependency btw drivers and cu libs, you may downgrade the base image based on your hardware.
FROM nvcr.io/nvidia/pytorch:21.10-py3
RUN pip install pycuda==2022.1
RUN conda install numba==0.54.0
Trying to solve this by the end of the week
Hi, I haven't been able to solve my issue, not sure what is going wrong. I did try to use your recommended docker image but got the same error so I am not sure what is wrong here.
Can you show your error messages and environment setup?
Hi, Unfortunately, I haven't changed much in the configuration and the errors are exactly the same as last time :-1: So not sure what to try from here.
I am using this docker image https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/blob/master/pytorch-notebook/Dockerfile and accessing my cluster instance through jupyterhub/jupyterlbab. I have tried the code on GeForce GTX 1080 Ti.
I also tried using the image from which you provided the Dockerfile, run the test example and got the same result.
Hi, Not sure what is not working here. I followed the implementation of the pytorch lightning tutorial. I am trying to use this code to run my training on 2 GPU (NVIDIA GeForce GTX 1080 Ti). My configuration is unchange to the single GPU except for what I found in the pytorch lightning tutorial and I get the following warnings:
and
Not sure what parameters I can change to correct those. Those are warning and the computation happens, but I am not sure if it's done correctly (results wise and speed wise). It seems to me that the speed doesn't increase that much so I am afraid that the second GPU is not used (which seems to be confirmed by my GPU utilization metrics).