A Multi GPU Error. - Githubissues

caryleo commented 4 years ago

Hi! Thank you for your amazing code base. I'm reading and trying to retrain the model, especially in multi GPU way. Since I can't find manual about multi GPU setting, I use the traditional way to set the system environment variable CUDA_VISIBLE_DEVICES, like: $ CUDA_VISIBLE_DEVICES="6,7" python train.py --cfg configs/updown.yml --id test_updown` I have alreay executed the preprocessing scripts and downloaded the updown features. But here raised an Error (reproducible):

Hugginface transformers not installed; please visit https://github.com/huggingface/transformers meshed-memory-transformer not installed; please runpip install git+https://github.com/ruotianluo/meshed-memory-transformer.git` Warning: coco-caption not available DataLoader loading json file: data/cocotalk.json vocab size is 9487 DataLoader loading h5 file: data/cocotalk_fc data/cocotalk_att data/cocotalk_box data/cocotalk_label.h5 max sequence length in data is 16 read 123287 image features assigned 113287 images to split train assigned 5000 images to split val assigned 5000 images to split test Read data: 0.0038635730743408203 Traceback (most recent call last): File "train.py", line 285, in train(opt) File "train.py", line 178, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag, struc_flag) File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(*input, *kwargs) File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, kwargs) File "/home/gary_liu/Documents/CODE/self-critical.pytorch/misc/loss_wrapper.py", line 45, in forward loss = self.crit(self.model(fc_feats, att_feats, labels, att_masks), labels[..., 1:], masks[..., 1:]) File "/home/gary_liu/anaconda3/envs/lazarus/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/home/garyliu/Documents/CODE/self-critical.pytorch/models/CaptionModel.py", line 32, in forward return getattr(self, ''+mode)(args, *kwargs) File "/home/gary_liu/Documents/CODE/self-critical.pytorch/models/AttModel.py", line 124, in _forward state = self.init_hidden(batch_sizeseq_per_img) File "/home/gary_liu/Documents/CODE/self-critical.pytorch/models/AttModel.py", line 95, in init_hidden weight = next(self.parameters()) StopIteration

This error is only raised when I use the CUDA_VISIBLE_DEVICES to multi GPU, while a single GPU setting (CUDA_VISIBLE_DEVICES=6) can run stably. I notice two warnings at the top of the output, I wonder if these missing module cause the problem.

Thank you very much!

ruotianluo commented 4 years ago

Don't worry about the warning.

The error is caused by weight = next(self.parameters()). For some reason self.parameters() gives you an empty iterator. Not sure why. For now, you may replace init_hidden with.

        return (torch.zeros(self.num_layers, bsz, self.rnn_size).cuda(),
                torch.zeros(self.num_layers, bsz, self.rnn_size).cuda())

Not 100% sure if it would work or not.

caryleo commented 4 years ago

Thank you for your help! After the modification the code can run stably. After that I recreate a new conda environment with pytorch=1.3.0 (I used pytorch=1.5.0), the code can also run stably without the modification. So I think it's the matter of the new version.

ruotianluo / self-critical.pytorch

A Multi GPU Error. #193