StopIteration error? - Githubissues

Vincent131499 commented 4 years ago

首先感谢大佬杰出的开源工作，正好匹配需求。但是在具体运行时，出现如下报错，不知道是怎么回事，请大佬指教！敬请回复！

07/10/2020 16:14:08 - INFO - root - Running training 07/10/2020 16:14:08 - INFO - root - Num examples = 10748 07/10/2020 16:14:08 - INFO - root - Num Epochs = 4 07/10/2020 16:14:08 - INFO - root - Instantaneous batch size per GPU = 24 07/10/2020 16:14:08 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 48 07/10/2020 16:14:08 - INFO - root - Gradient Accumulation steps = 1 07/10/2020 16:14:08 - INFO - root - Total optimization steps = 896 Traceback (most recent call last): File "run_ner_crf.py", line 497, in main() File "run_ner_crf.py", line 438, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_ner_crf.py", line 132, in train outputs = model(inputs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(input, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/bert_for_ner.py", line 58, in forward outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(input, **kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/transformers/modeling_bert.py", line 606, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration

lonePatient commented 4 years ago

@Vincent131499 看样子不是脚本问题，应该是torch环境相关问题

LaVineChan commented 4 years ago

首先感谢大佬杰出的开源工作，正好匹配需求。但是在具体运行时，出现如下报错，不知道是怎么回事，请大佬指教！敬请回复！

07/10/2020 16:14:08 - INFO - root - Running training 07/10/2020 16:14:08 - INFO - root - Num examples = 10748 07/10/2020 16:14:08 - INFO - root - Num Epochs = 4 07/10/2020 16:14:08 - INFO - root - Instantaneous batch size per GPU = 24 07/10/2020 16:14:08 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 48 07/10/2020 16:14:08 - INFO - root - Gradient Accumulation steps = 1 07/10/2020 16:14:08 - INFO - root - Total optimization steps = 896 Traceback (most recent call last): File "run_ner_crf.py", line 497, in main() File "run_ner_crf.py", line 438, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_ner_crf.py", line 132, in train outputs = model(inputs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(input, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/bert_for_ner.py", line 58, in forward outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call* result = self.forward(input, **kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/transformers/modeling_bert.py", line 606, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration

我也遇到了这个问题，我把我的torch版本从1.5改成1.2就没问题了

lonePatient commented 4 years ago

@LaVineChan 感谢，回头我测试下torch1.5+版本试试，我个人主要使用的torch1.4

lonePatient / BERT-NER-Pytorch

StopIteration error? #22