Open kunibald413 opened 7 months ago
Having the same issue with multiple GPUs
There is a ddp config in the config folder, which does the ddp strategy flag true to use multi gpu. About the consequences for unused parameters, I am not very well versed with that at the moment.
in file /configs/trainer/ddp.yaml set strategy: ddp_find_unused_parameters_true
defaults:
- default
strategy: ddp_find_unused_parameters_true
accelerator: gpu
devices: [0,1,2]
num_nodes: 1
sync_batchnorm: True
in file /configs/train.yaml set trainer: ddp
# @package _global_
# specify here default configuration
# order of defaults determines the order in which configs override each other
defaults:
- _self_
- data: ljspeech
- model: pflow
- callbacks: default
- logger: tensorboard # set logger here or use command line (e.g. `python train.py logger=tensorboard`)
- trainer: ddp
- paths: default
- extras: default
- hydra: default
added
devices="auto"
intrain.py
to utilize multiple gpusTraining terminates shortly after start with this error:
adding
strategy='ddp_find_unused_parameters_true'
to the trainer instantiate fixes it (all gpus used):the
batch_idx
is not used when returning the loss bydef training_step(self, batch: Any, batch_idx: int):
inbaselightingmodule.py
but i'm not sure about the consequences/effect on training/quality. making issue for visibility