Multi gpu training; RuntimeError: [...] LightningModule has parameters that were not used in producing the loss returned by training_step.

kunibald413 commented 7 months ago

added devices="auto" in train.py to utilize multiple gpus

    trainer: Trainer = hydra.utils.instantiate(cfg.trainer,
                                               callbacks=callbacks,
                                               logger=logger,
                                               devices="auto")

Training terminates shortly after start with this error:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

adding strategy='ddp_find_unused_parameters_true' to the trainer instantiate fixes it (all gpus used):

    trainer: Trainer = hydra.utils.instantiate(cfg.trainer,
                                               callbacks=callbacks,
                                               logger=logger,
                                               devices="auto",
                                               # for multi device training
                                               strategy='ddp_find_unused_parameters_true'
                                               )

the batch_idx is not used when returning the loss by def training_step(self, batch: Any, batch_idx: int): in baselightingmodule.py but i'm not sure about the consequences/effect on training/quality. making issue for visibility

egorsmkv commented 6 months ago

Having the same issue with multiple GPUs

p0p4k commented 6 months ago

There is a ddp config in the config folder, which does the ddp strategy flag true to use multi gpu. About the consequences for unused parameters, I am not very well versed with that at the moment.

skypro1111 commented 6 months ago

in file /configs/trainer/ddp.yaml set strategy: ddp_find_unused_parameters_true

defaults:
  - default

strategy: ddp_find_unused_parameters_true

accelerator: gpu
devices: [0,1,2]
num_nodes: 1
sync_batchnorm: True

in file /configs/train.yaml set trainer: ddp

# @package _global_

# specify here default configuration
# order of defaults determines the order in which configs override each other
defaults:
  - _self_
  - data: ljspeech
  - model: pflow
  - callbacks: default
  - logger: tensorboard # set logger here or use command line (e.g. `python train.py logger=tensorboard`)
  - trainer: ddp
  - paths: default
  - extras: default
  - hydra: default

p0p4k / pflowtts_pytorch

Multi gpu training; RuntimeError: [...] LightningModule has parameters that were not used in producing the loss returned by training_step. #8