error in multi machine ong gpu training - Githubissues

tianweiy / CenterPoint

MIT License

1.81k stars 449 forks source link

error in multi machine ong gpu training #268

Open TianhaoFu opened 2 years ago

TianhaoFu commented 2 years ago

hi, when i was use torch.ddp on 8 a10 gpu machines(and machine has one gpu.) i came across such error:

Traceback (most recent call last):
  File "./tools/train.py", line 172, in <module>
    main()
  File "./tools/train.py", line 167, in main
    logger=logger,
  File "/centerpoint/centerpoint/det3d/torchie/apis/train.py", line 364, in train_detector
    trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
  File "/centerpoint/centerpoint/det3d/torchie/trainer/trainer.py", line 542, in run
    epoch_runner(data_loaders[i], self.epoch, **kwargs)
  File "/centerpoint/centerpoint/det3d/torchie/trainer/trainer.py", line 409, in train
    self.model, data_batch, train_mode=True, **kwargs
  File "/centerpoint/centerpoint/det3d/torchie/trainer/trainer.py", line 367, in batch_processor_inline
    losses = model(example, return_loss=True)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 903, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 735, in forward
    assert(self.find_unused_parameters == False)
AssertionError
Killing subprocess 1230

my running command is

python -m torch.distributed.launch --nnodes=$WORLD_SIZE --node_rank=$RANK --nproc_per_node=1 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ./tools/train.py ./configs/nusc/pp/nusc_centerpoint_pp_02voxel_two_pfn_10sweep.py

I have not changed any code, only the running command.

why this bug happen? Could you tell me which module of the model output does not participate in the calculation of the loss, so I can check the cause of the error thanks!

tianweiy commented 2 years ago

you can set find_unused_parameters to false

tianweiy commented 2 years ago

https://github.com/tianweiy/CenterPoint/blob/47d61adacb75ba6ccb49ae666dc257d04e323a2c/det3d/torchie/apis/train.py#L289

TianhaoFu commented 2 years ago

when i set false, i came across such error

Traceback (most recent call last):
  File "./tools/train.py", line 181, in <module>
    main()
  File "./tools/train.py", line 176, in main
    logger=logger,
  File "/centerpoint/centerpoint/det3d/torchie/apis/train.py", line 326, in train_detector
    trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
  File "/centerpoint/centerpoint/det3d/torchie/trainer/trainer.py", line 545, in run
    epoch_runner(data_loaders[i], self.epoch, **kwargs)
  File "/centerpoint/centerpoint//det3d/torchie/trainer/trainer.py", line 412, in train
    self.model, data_batch, train_mode=True, **kwargs
  File "/centerpoint/centerpoint/det3d/torchie/trainer/trainer.py", line 367, in batch_processor_inline
    losses = model(example, return_loss=True)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 903, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/pai/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 714, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).

@tianweiy

tianweiy commented 2 years ago

torch version ?

tianweiy commented 2 years ago

i can't find this line in my torch

File "/home/pai/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 735, in forward assert(self.find_unused_parameters == False)

TianhaoFu commented 2 years ago

my torch version is 1.8 and my cuda version is 11.0

thanks:)

tianweiy commented 2 years ago

i get multiple phd interviews this week and won't be able to check until next monday. Maybe you can first debug it a bit

tianweiy commented 2 years ago

Maybe you need to use slurm runner ? Multimachine is a bit complicated. The original command may not work. Sorry I don't have any experience with this. I can ask around if you haven't fixed the issue