nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
MIT License
1.29k stars 465 forks source link

Debugging hint for Index Error encountered after supplying the network with Japanese data #185

Open wailoktam opened 4 years ago

wailoktam commented 4 years ago

I am new to torch although I have some experience with debugging tensorflow and keras. I wonder how should I start with the following error message:

IndexError Traceback (most recent call last)

in () 151 elif (args.task == 'ext'): 152 if (args.mode == 'train'): --> 153 train_ext(args, device_id) 154 elif (args.mode == 'validate'): 155 validate_ext(args, device_id) 14 frames /content/train_extractive.py in train_ext(args, device_id) 225 train_multi_ext(args) 226 else: --> 227 train_single_ext(args, device_id) 228 229 /content/train_extractive.py in train_single_ext(args, device_id) 267 268 trainer = build_trainer(args, device_id, model, optim) --> 269 trainer.train(train_iter_fct, args.train_steps) /content/trainer_ext.py in train(self, train_iter_fct, train_steps, valid_iter_fct, valid_steps) 150 self._gradient_accumulation( 151 true_batchs, normalization, total_stats, --> 152 report_stats) 153 154 report_stats = self._maybe_report_training( /content/trainer_ext.py in _gradient_accumulation(self, true_batchs, normalization, total_stats, report_stats) 393 mask_cls = batch.mask_cls 394 --> 395 sent_scores, mask = self.model(src, segs, clss, mask, mask_cls) 396 397 loss = self.loss(sent_scores, labels.float()) /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) /content/model_builder.py in forward(self, src, segs, clss, mask_src, mask_cls) 169 170 def forward(self, src, segs, clss, mask_src, mask_cls): --> 171 top_vec = self.bert(src, segs, mask_src) 172 sents_vec = top_vec[torch.arange(top_vec.size(0)).unsqueeze(1), clss] 173 sents_vec = sents_vec * mask_cls[:, :, None].float() /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) /content/model_builder.py in forward(self, x, segs, mask) 125 def forward(self, x, segs, mask): 126 if(self.finetune): --> 127 top_vec, _ = self.model(x, segs, attention_mask=mask) 128 else: 129 self.eval() /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) /usr/local/lib/python3.6/dist-packages/pytorch_transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, attention_mask, position_ids, head_mask) 705 head_mask = [None] * self.config.num_hidden_layers 706 --> 707 embedding_output = self.embeddings(input_ids, position_ids=position_ids, token_type_ids=token_type_ids) 708 encoder_outputs = self.encoder(embedding_output, 709 extended_attention_mask, /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) /usr/local/lib/python3.6/dist-packages/pytorch_transformers/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids) 249 token_type_ids = torch.zeros_like(input_ids) 250 --> 251 words_embeddings = self.word_embeddings(input_ids) 252 position_embeddings = self.position_embeddings(position_ids) 253 token_type_embeddings = self.token_type_embeddings(token_type_ids) /usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs) 548 result = self._slow_forward(*input, **kwargs) 549 else: --> 550 result = self.forward(*input, **kwargs) 551 for hook in self._forward_hooks.values(): 552 hook_result = hook(self, input, result) /usr/local/lib/python3.6/dist-packages/torch/nn/modules/sparse.py in forward(self, input) 112 return F.embedding( 113 input, self.weight, self.padding_idx, self.max_norm, --> 114 self.norm_type, self.scale_grad_by_freq, self.sparse) 115 116 def extra_repr(self): /usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 1722 # remove once script supports set_grad_enabled 1723 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type) -> 1724 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 1725 1726 IndexError: index out of range in self Many thanks
SebastianVeile commented 4 years ago

Are you using a different bert model? I got an error like this when I used a smaller bert model to train on a dataset that was preprocessed using a larger bert model. (I forgot to change the model before preprocessing)

wailoktam commented 4 years ago

Thanks a lot for getting back to me promptly. This is what I have got in model builder used during training following the pull request for Japanese you suggest.

class Bert(nn.Module):

def __init__(self, large, temp_dir, finetune=False):

    super(Bert, self).__init__()

    if(large):

        self.model = BertModel.from_pretrained('bert-base-multilingual-cased', cache_dir=temp_dir)

    else:

        self.model = BertModel.from_pretrained('bert-base-uncased', cache_dir=temp_dir)

    self.finetune = finetune

This is what I get in data builder used during preprocessing, again following the pull request you suggest:

class BertData():

def __init__(self, args):

    self.args = args

    self.tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=True)

    self.sep_token = '[SEP]'

    self.cls_token = '[CLS]'

    self.pad_token = '[PAD]'

    self.tgt_bos = '[unused7]'

    self.tgt_eos = '[unused1]'

    self.tgt_sent_split = '[unused2]'

    self.sep_vid = self.tokenizer.vocab[self.sep_token]

    self.cls_vid = self.tokenizer.vocab[self.cls_token]

    self.pad_vid = self.tokenizer.vocab[self.pad_token]

This line printed when trianing looked suspicious as it indicates it is not the multilingual bert model being used?

[2020-07-03 10:13:37,788 INFO] loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-pytorch_model.bin from cache at /content/drive/My Drive/PresummJaTemp/aa1ef1aede4482d0dbcd4d52baad8ae300e60902e88fcb0bebdec09afd232066.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157

Any suggestion would be welcomed. I am completely clueless about what to do, except for reading textbooks teaching pytorch.

SebastianVeile commented 4 years ago

I believe your issue might be that you have preprocessed the data using the bert_multilingual, but you are trying to train on bert_base-uncased - It all depends on if you are using the parameter -large true when training. If not then you should change this part of the code:

if(large)

    self.model = BertModel.from_pretrained('bert-base-multilingual-cased', cache_dir=temp_dir)

else:

    self.model = BertModel.from_pretrained('bert-base-uncased', cache_dir=temp_dir)

To this:

 if(large):

    self.model = BertModel.from_pretrained('bert-base-multilingual-cased', cache_dir=temp_dir)

else:

    self.model = BertModel.from_pretrained('bert-base-multilingual-cased', cache_dir=temp_dir)
wailoktam commented 4 years ago

Thanks a lot. It works like a charm!