jdb78 / pytorch-forecasting

Time series forecasting with PyTorch
https://pytorch-forecasting.readthedocs.io/
MIT License
3.85k stars 609 forks source link

Optuna seemingly stuck with multiple GPUs #441

Open DeastinY opened 3 years ago

DeastinY commented 3 years ago

Expected behavior

I'm working through the Demand forecasting with the Temporal Fusion Transformer and try to run the optimize_hyperparameters part on two GPUs.

Actual behavior

I get some output, but it never finishes. With only a single GPU utilized it finishes within minutes without any issues.

[I 2021-04-13 15:40:26,906] A new study created in memory with name: no-name-e455a085-bb8c-4052-a225-ef363fb68e4c initializing ddp: GLOBAL_RANK: 1, MEMBER: 1/2

Code to reproduce the problem

https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/stallion.html

this works:

study = optimize_hyperparameters(
    train_dataloader,
    val_dataloader,
    model_path="optuna_test",
    n_trials=200,
    max_epochs=50,
    gradient_clip_val_range=(0.01, 1.0),
    hidden_size_range=(8, 128),
    hidden_continuous_size_range=(8, 128),
    attention_head_size_range=(1, 4),
    learning_rate_range=(0.001, 0.1),
    dropout_range=(0.1, 0.3),
    trainer_kwargs=dict(limit_train_batches=30),
    reduce_on_plateau_patience=4,
    use_learning_rate_finder=False,  # use Optuna to find ideal learning rate or use in-built learning rate finder
)

changing this, it doesn't anymore:

    trainer_kwargs=dict(limit_train_batches=30, gpus=2),
jdb78 commented 3 years ago

Could you add accelerator="ddp" to the trainer_kwargs?

DeastinY commented 3 years ago

It runs, but does not use both GPUs.

[I 2021-04-20 16:14:18,058] A new study created in memory with name: no-name-e6dcc64e-75aa-4f8b-8e26-b632835e3df1
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores
GPU available: True, used: True
INFO:lightning:GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO:lightning:TPU available: None, using: 0 TPU cores
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Set SLURM handle signals.
INFO:lightning:Set SLURM handle signals.
Finding best initial lr: 100%|██████████| 100/100 [01:04<00:00,  1.55it/s]
[I 2021-04-20 16:15:46,888] Using learning rate of 0.0224
INFO:pytorch_lightning.accelerators.gpu:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO:lightning:initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
Set SLURM handle signals.
INFO:lightning:Set SLURM handle signals.

[... model info removed to declutter ...]

Epoch 0:   0%|          | 1/1520 [00:01<30:08,  1.19s/it, loss=24.1, v_num=0, val_loss=29.60]

INFO:root:Reducer buckets have been rebuilt in this iteration.

Epoch 0:  11%|█▏        | 173/1520 [01:55<15:01,  1.49it/s, loss=11.9, v_num=0, val_loss=29.60, train_loss_step=11.70]

this is the output of NVIDIA-SMI

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   49C    P0    79W / 300W |   2870MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   44C    P0    41W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
jdb78 commented 3 years ago

Strange, does it work when training directly (no hyperparameter tuning) or is PyTorch Lightning also only using one GPU?

jwezel commented 3 years ago

I might have the same problems. optimize_hyperparameters() is extremely slow and 2 "threads" (for 2 GPUs) run after each other.

Something that I think is really wrong is that two subprocesses are spawned with the same command line as the original process . I wonder which of the modules does that.

DeastinY commented 3 years ago

Strange, does it work when training directly (no hyperparameter tuning) or is PyTorch Lightning also only using one GPU?

Sorry for the delayed response. When training directly it seems to lead data to one GPU and then do nothing.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   50C    P0    60W / 300W |   1300MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:89:00.0 Off |                    0 |
| N/A   46C    P0    42W / 300W |      3MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
jdb78 commented 3 years ago

Do you have the same issue with the example here? https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_simple.py I wonder if this is a third party bug. If not, maybe you spot the difference in implementations.

DeastinY commented 3 years ago

Running the examples leads to this issue: https://github.com/optuna/optuna-examples/issues/14

nzw0301 commented 2 years ago

Hi, I'm Kento Nozawa from the Optuna community. The latest Optuna's PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.

Best,

cody-mar10 commented 1 year ago

Hi, I'm Kento Nozawa from the Optuna community. The latest Optuna's PyTorch-lightning callback can handle the distributed training! The minimal example is https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_lightning_ddp.py.

In the linked example, DDP spawn is used instead of the typical DDP strategy. Is that absolutely required?

aman1b commented 2 months ago

I might have the same problems. optimize_hyperparameters() is extremely slow and 2 "threads" (for 2 GPUs) run after each other.

Something that I think is really wrong is that two subprocesses are spawned with the same command line as the original process . I wonder which of the modules does that.

Did you manage to solve this issue? I am trying to use this function in DDP over 2 GPUs but it is very slow and only using 1 GPU? When I use "ddp" in the trainer_kwags it says that each model has different parameters. I tried setting seeds but this did not help. Any help would be greatly appreciated!