salesforce / warp-drive

Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)
BSD 3-Clause "New" or "Revised" License
460 stars 78 forks source link

Error trying to use Pytorch with tow GPUs #78

Closed Finebouche closed 1 year ago

Finebouche commented 1 year ago

Hi, Not sure what is not working here. I followed the implementation of the pytorch lightning tutorial. I am trying to use this code to run my training on 2 GPU (NVIDIA GeForce GTX 1080 Ti). My configuration is unchange to the single GPU except for what I found in the pytorch lightning tutorial and I get the following warnings:

/project/MARL_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: 
PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. 
Consider increasing the value of the `num_workers` argument` (try 32 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/project/MARL_env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1609: PossibleUserWarning: 
The number of training batches (1) is smaller than the logging interval Trainer(log_every_n_steps=10). 
Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(

and

/project_ghent/MARL_env/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:232: 
UserWarning: You called `self.log('VF loss coefficient_prey', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.

Not sure what parameters I can change to correct those. Those are warning and the computation happens, but I am not sure if it's done correctly (results wise and speed wise). It seems to me that the speed doesn't increase that much so I am afraid that the second GPU is not used (which seems to be confirmed by my GPU utilization metrics).

Emerald01 commented 1 year ago

If you want to spin up multiple GPUs, please refer to example trainer script and WarpDrive managed multi gpu training pipeline which is one single line of function here, this script itself supports multiple GPUs, so you can try it first. https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/example_training_script_numba.py#L213

Sorry about the mislead. In fact, Pytorch lighting can only manage the multi-GPUs for pytorch model part, but not the environment steps part, actually you know the latter is the key of WarpDrive. We have multi-gpu managers for both env steps and pytorch models. In particular, we have model sync up with pytorch DDP and GPU context manager for step function executables running on multiple GPUs. The problem you see is simply because by default CPU host will always work with the default GPU, and some advanced settings are required to distribute the host to multiple processes and each process to control one GPU. From our benchmark on V100 and A100 GPUs, the speed is almost going linearly with N_GPUs with N_GPUs <= 4 at least.

BTW, since Pytorch lightning has very frequent update on the backend and sometimes it will fail some existing pipelines, I do not really suggest you using lightning for now.

Finebouche commented 1 year ago

Hi, thanks for the explanation.

I actually tried reusing that distributed function but without success. Firstly the scripts

python warp_drive/training/example_training_script_pycuda.py --env tag_continuous

and

python warp_drive/training/example_training_script_numba.py --env tag_continuous

were giving me this error :

RuntimeError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 10.92 GiB total capacity; 6.73 GiB already allocated; 1.33 GiB free; 8.34 GiB reserved in total by PyTorch) 
If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Which is weird to me because I would have expect that to work the same way as calling the Environment Warper and Trainer from the notebook.

And when using perform_distributed_training from the notebook I also encountered some other problems as well. I will try to reproduce the error I had and see if I can fix it but I am still puzzled by the error I get from the script.

Emerald01 commented 1 year ago

The error you saw is simply the OOM. What you can do is to reduce the number of agents set up in the yaml configure https://github.com/salesforce/warp-drive/blob/master/warp_drive/training/run_configs/tag_continuous.yaml

I just tested continuous demo script and it runs well on 2 A-100 GPUs, and you can see both GPUs have equal share.

Screenshot 2023-04-12 at 10 07 21 AM
Finebouche commented 1 year ago

Hi, Sorry for the late reply and thank you for your previous help ! I succefully run the script on a single GPU using your advice to reduce the memory usage but I still get a different error trying to use two GPUS. I have the following stack :

We have successfully found 2 GPUs!
Training with 2 GPU(s).
Starting worker process: 1 
Starting worker process: 0 
ERROR:root:Address already in use
Process NumbaDeviceContextProcessWrapper-1:
Traceback (most recent call last):
  File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
    self._clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
    process_group_torch.clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
    dist.destroy_process_group()
  File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
    assert pg is not None
AssertionError
Address already in use

Pycuda gives an extra :

ERROR:root:Address already in use
Process PyCUDADeviceContextProcessWrapper-1:
Traceback (most recent call last):
  File "/project/MARL_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 81, in run
    self._clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/child_process_base.py", line 56, in _clear_torch_process_group
    process_group_torch.clear_torch_process_group()
  File "/project/warp-drive/warp_drive/training/utils/device_child_process/process_group_torch.py", line 20, in clear_torch_process_group
    dist.destroy_process_group()
  File "/project/MARL_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 775, in destroy_process_group
    assert pg is not None
AssertionError
-------------------------------------------------------------------
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
-------------------------------------------------------------------

No process is being launched in the same time and I am not sure.

I am trying to investigate the context.pop advice by looking at /child_process_base.py but not sure if that is the right idea.

Emerald01 commented 1 year ago

Not pretty sure about this error as it says "ERROR:root:Address already in use". It sounds to me that the previous GPU context has not been cleared, you may check if the previous task has any zombie process occupied. Since I cannot reproduce your error on my end so I cannot give more clear advice here. What is your multi GPU setup look like?

Finebouche commented 1 year ago

Hi, I took me some time. I am using two NVIDIA GeForce GTX 1080 Ti : Capture d’écran 2023-04-19 à 16 25 42

I am running my code inside a docker image instantiated using : https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/tree/master/pytorch-notebook which is a fork of the public Jupyter Docker Stacks with GPU-processing capability as explained here

Emerald01 commented 1 year ago

I just simply use the Docker with NVIDIA maintained base image, this is the most robust image I know so far. It handles every tricky dependency btw drivers and cu libs, you may downgrade the base image based on your hardware.

FROM nvcr.io/nvidia/pytorch:21.10-py3

RUN pip install pycuda==2022.1
RUN conda install numba==0.54.0
Finebouche commented 1 year ago

Trying to solve this by the end of the week

Finebouche commented 1 year ago

Hi, I haven't been able to solve my issue, not sure what is going wrong. I did try to use your recommended docker image but got the same error so I am not sure what is wrong here.

Emerald01 commented 1 year ago

Can you show your error messages and environment setup?

Finebouche commented 1 year ago

Hi, Unfortunately, I haven't changed much in the configuration and the errors are exactly the same as last time :-1: So not sure what to try from here.

I am using this docker image https://gitlab.ilabt.imec.be/ilabt/gpu-docker-stacks/-/blob/master/pytorch-notebook/Dockerfile and accessing my cluster instance through jupyterhub/jupyterlbab. I have tried the code on GeForce GTX 1080 Ti.

I also tried using the image from which you provided the Dockerfile, run the test example and got the same result.

image