Open DeastinY opened 3 years ago
Could you add accelerator="ddp"
to the trainer_kwargs
?
It runs, but does not use both GPUs.
[I 2021-04-20 16:14:18,058] A new study created in memory with name: no-name-e6dcc64e-75aa-4f8b-8e26-b632835e3df1
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Set SLURM handle signals.
INFO:lightning:Set SLURM handle signals.
Finding best initial lr: 100%|██████████| 100/100 [01:04<00:00, 1.55it/s]
[I 2021-04-20 16:15:46,888] Using learning rate of 0.0224
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
Set SLURM handle signals.
INFO:lightning:Set SLURM handle signals.
[... model info removed to declutter ...]
Epoch 0: 0%| | 1/1520 [00:01<30:08, 1.19s/it, loss=24.1, v_num=0, val_loss=29.60]
INFO:root:Reducer buckets have been rebuilt in this iteration.
Epoch 0: 11%|█▏ | 173/1520 [01:55<15:01, 1.49it/s, loss=11.9, v_num=0, val_loss=29.60, train_loss_step=11.70]
this is the output of NVIDIA-SMI
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:62:00.0 Off | 0 |
| N/A 49C P0 79W / 300W | 2870MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 44C P0 41W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Strange, does it work when training directly (no hyperparameter tuning) or is PyTorch Lightning also only using one GPU?
I might have the same problems. optimize_hyperparameters()
is extremely slow and 2 "threads" (for 2 GPUs) run after each other.
Something that I think is really wrong is that two subprocesses are spawned with the same command line as the original process . I wonder which of the modules does that.
Strange, does it work when training directly (no hyperparameter tuning) or is PyTorch Lightning also only using one GPU?
Sorry for the delayed response. When training directly it seems to lead data to one GPU and then do nothing.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:62:00.0 Off | 0 |
| N/A 50C P0 60W / 300W | 1300MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:89:00.0 Off | 0 |
| N/A 46C P0 42W / 300W | 3MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Do you have the same issue with the example here? https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py I wonder if this is a third party bug. If not, maybe you spot the difference in implementations.
Running the examples leads to this issue: https://github.com/optuna/optuna-examples/issues/14
Hi, I'm Kento Nozawa from the Optuna community. The latest Optuna's PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.
Best,
Hi, I'm Kento Nozawa from the Optuna community. The latest Optuna's PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.
In the linked example, DDP spawn is used instead of the typical DDP strategy. Is that absolutely required?
I might have the same problems.
optimize_hyperparameters()
is extremely slow and 2 "threads" (for 2 GPUs) run after each other.Something that I think is really wrong is that two subprocesses are spawned with the same command line as the original process . I wonder which of the modules does that.
Did you manage to solve this issue? I am trying to use this function in DDP over 2 GPUs but it is very slow and only using 1 GPU? When I use "ddp" in the trainer_kwags it says that each model has different parameters. I tried setting seeds but this did not help. Any help would be greatly appreciated!
Expected behavior
I'm working through the Demand forecasting with the Temporal Fusion Transformer and try to run the
optimize_hyperparameters
part on two GPUs.Actual behavior
I get some output, but it never finishes. With only a single GPU utilized it finishes within minutes without any issues.
[I 2021-04-13 15:40:26,906] A new study created in memory with name: no-name-e455a085-bb8c-4052-a225-ef363fb68e4c initializing ddp: GLOBAL_RANK: 1, MEMBER: 1/2
Code to reproduce the problem
https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html
this works:
changing this, it doesn't anymore: