Closed Vincent131499 closed 4 years ago
@Vincent131499 看样子不是脚本问题,应该是torch环境相关问题
首先感谢大佬杰出的开源工作,正好匹配需求。 但是在具体运行时,出现如下报错,不知道是怎么回事,请大佬指教! 敬请回复!
07/10/2020 16:14:08 - INFO - root - Running training 07/10/2020 16:14:08 - INFO - root - Num examples = 10748 07/10/2020 16:14:08 - INFO - root - Num Epochs = 4 07/10/2020 16:14:08 - INFO - root - Instantaneous batch size per GPU = 24 07/10/2020 16:14:08 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 48 07/10/2020 16:14:08 - INFO - root - Gradient Accumulation steps = 1 07/10/2020 16:14:08 - INFO - root - Total optimization steps = 896 Traceback (most recent call last): File "run_ner_crf.py", line 497, in main() File "run_ner_crf.py", line 438, in main global_step, tr_loss = train(args, train_dataset, model, tokenizer) File "run_ner_crf.py", line 132, in train outputs = model(inputs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, *kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply output.reraise() File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) StopIteration: Caught StopIteration in replica 0 on device 0. Original Traceback (most recent call last): File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker output = module(input, kwargs) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/bert_for_ner.py", line 58, in forward outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids) File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call* result = self.forward(input, **kwargs) File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/transformers/modeling_bert.py", line 606, in forward extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility StopIteration
我也遇到了这个问题,我把我的torch版本从1.5改成1.2就没问题了
@LaVineChan 感谢,回头我测试下torch1.5+版本试试,我个人主要使用的torch1.4
首先感谢大佬杰出的开源工作,正好匹配需求。 但是在具体运行时,出现如下报错,不知道是怎么回事,请大佬指教! 敬请回复!
07/10/2020 16:14:08 - INFO - root - Running training 07/10/2020 16:14:08 - INFO - root - Num examples = 10748 07/10/2020 16:14:08 - INFO - root - Num Epochs = 4 07/10/2020 16:14:08 - INFO - root - Instantaneous batch size per GPU = 24 07/10/2020 16:14:08 - INFO - root - Total train batch size (w. parallel, distributed & accumulation) = 48 07/10/2020 16:14:08 - INFO - root - Gradient Accumulation steps = 1 07/10/2020 16:14:08 - INFO - root - Total optimization steps = 896 Traceback (most recent call last): File "run_ner_crf.py", line 497, in
main()
File "run_ner_crf.py", line 438, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_ner_crf.py", line 132, in train
outputs = model(inputs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, *kwargs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
raise self.exc_type(msg)
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(input, kwargs)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(*input, *kwargs)
File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/bert_for_ner.py", line 58, in forward
outputs =self.bert(input_ids = input_ids,attention_mask=attention_mask,token_type_ids=token_type_ids)
File "/home/user/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in call
result = self.forward(input, **kwargs)
File "/mnt/stephen-lib/stephen的个人文件夹/my_code/NLP组件研发/细粒度实体识别/BERT-NER-Pytorch/models/transformers/modeling_bert.py", line 606, in forward
extended_attention_mask = extended_attention_mask.to(dtype=next(self.parameters()).dtype) # fp16 compatibility
StopIteration