Error when training superglue

zhanghua7099 commented 3 years ago

Hi!

I have 4 2080TI GPUs and want to use them to train the superglue model. I try to run the following command: python3 -m torch.distributed.launch --nproc_per_node=4 train_superglue.py --config_path configs/coco_config.yaml

But I get errors:

Traceback (most recent call last):
  File "train_superglue.py", line 252, in <module>
    train(config, opt.local_rank)
  File "train_superglue.py", line 165, in train
    total_loss, pos_loss, neg_loss = superglue_model.forward_train(superglue_input)
  File "/home/zhy/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 772, in __getattr__
    type(self).__name__, name))
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'forward_train'

How to fix this problem?

zhanghua7099 commented 3 years ago

The single GPU version will be fine. This error only occurs in muti-GPU version.

gouthamvgk commented 3 years ago

I wrote the code for Distributed training, but never tested it. Will fix it by today

gouthamvgk commented 3 years ago

@zhanghua7099 I have updated the repo. You can pull it and try. Please reply back if there is any other error. As I don't have a multi GPU system I can't test it.

zhanghua7099 commented 3 years ago

The updated version runs well with multi-GPU.

Thank you for your excellent work!

gouthamvgk / SuperGlue_training

Error when training superglue #1