Got error when training with transformer(no sc) : _sample() missing 2 required positional arguments: 'fc_feats' and 'att_feats'

PineappleWill commented 3 years ago

hi @ruotianluo, thanks for your amazing work

using python tools/train.py --cfg configs/transformer/transformer.yml --id transformer

I got this error during the training process of transformer, which really confused me.

_iter 2998 (epoch 0), train_loss = 3.325, time/batch = 0.344 Read data: 0.00010228157043457031 iter 2999 (epoch 0), train_loss = 3.298, time/batch = 0.346 Traceback (most recent call last): File "tools/train.py", line 289, in train(opt) File "tools/train.py", line 242, in train val_loss, predictions, lang_stats = eval_utils.eval_split( File "/home/ordinary/self-critical.pytorch/captioning/utils/eval_utils.py", line 171, in eval_split seq, seq_logprobs = model(fc_feats, att_feats, att_masks, opt=tmp_eval_kwargs, mode='sample') File "/home/ordinary/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, kwargs) File "/home/ordinary/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/ordinary/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/ordinary/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/ordinary/miniconda3/lib/python3.8/site-packages/torch/_utils.py", line 428, in reraise raise self.exc_type(msg) TypeError: Caught TypeError in replica 7 on device 7. Original Traceback (most recent call last): File "/home/ordinary/miniconda3/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, *kwargs) File "/home/ordinary/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, kwargs) File "/home/ordinary/self-critical.pytorch/captioning/models/CaptionModel.py", line 33, in forward return getattr(self, '_'+mode)(*args, **kwargs) TypeError: _sample() missing 2 required positional arguments: 'fc_feats' and 'attfeats'

ruotianluo commented 3 years ago

It looks super weird to me. Can you run it under single-gpu setting?

PineappleWill commented 3 years ago

It looks super weird to me. Can you run it under single-gpu setting?

Solved with single-gpu, thank you !

HongkuanZhang commented 2 years ago

Hi @ruotianluo, thanks for your great work.

I also had the same issue and solved it with single-gpu setting, but I hope to know how to solve this problem under multi-gpu setting, or are there some hints to find what this problem caused by?

Sorry for bothering you in your busy time and thanks again.

ruotianluo commented 2 years ago

Try with the pytorch lightning training script: train_pl.py?

ruotianluo / self-critical.pytorch

Got error when training with transformer(no sc) : _sample() missing 2 required positional arguments: 'fc_feats' and 'att_feats' #244