macanv / BERT-BiLSTM-CRF-NER

Tensorflow solution of NER task Using BiLSTM-CRF model with Google BERT Fine-tuning And private Server services
https://github.com/macanv/BERT-BiLSMT-CRF-NER
4.69k stars 1.26k forks source link

Found Inf or NaN global norm #100

Open hanyaqian opened 5 years ago

hanyaqian commented 5 years ago

总是会遇到Found Inf or NaN global norm,要怎么办呢?

123 INFO:tensorflow:Saving checkpoints for 0 into ./output/result_dir/model.ckpt.
124 2019-04-01 11:26:15.232850: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f1ba962460    0 = {1, 0} Found Inf or NaN global norm.
125 INFO:tensorflow:Error recorded from training_loop: Found Inf or NaN global norm. : Tensor had NaN values
126    [[node VerifyFinite/CheckNumerics (defined at /disk1/hanyaqian/code/work15_bert_cpr/youdao_cpr/bert/optimization.p    y:74)  = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/    device:GPU:0"](global_norm/global_norm)]]
127 
128 Caused by op u'VerifyFinite/CheckNumerics', defined at:
129   File "run_classifier_cpr.py", line 785, in <module>
130     tf.app.run()
131   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/platform/app.py", li    ne 125, in run
132     _sys.exit(main(argv))
133   File "run_classifier_cpr.py", line 712, in main
134     estimator.train(input_fn=train_input_fn, max_steps=next_checkpoint)
135   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_    estimator.py", line 2403, in train
136     saving_listeners=saving_listeners
137   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 354, in train
138     loss = self._train_model(input_fn, hooks, saving_listeners)
139   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 1207, in _train_model
140     return self._train_model_default(input_fn, hooks, saving_listeners)
141   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 1237, in _train_model_default
142     features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
143   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_    estimator.py", line 2195, in _call_model_fn
144     features, labels, mode, config)
145   File "/disk1/hanyaqian/code/work15_bert_cpr/venv/lib/python2.7/site-packages/tensorflow/python/estimator/estimator.    py", line 1195, in _call_model_fn
146     model_fn_results = self._model_fn(features=features, **kwargs)
 NOR
macanv commented 5 years ago

您这个应该不是直接运行的我的代码吧,改动的地方也不清楚。没办法看出来是什么问题。

macanv commented 5 years ago

one more thing,2.7环境没测试过。

njusq commented 5 years ago

我也碰到了同样的问题(虽然不是同一个程序),我正在用tfdbg调试,能帮助查找程序中出现的nan值,后来发现是自己之前没注意的一个地方存在0除以0导致了nan的出现。希望对你有帮助。

geibeile commented 4 years ago

我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行

HqWu-HITCS commented 4 years ago

我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行

请问您解决这个问题了吗?我在做其他任务的时候也遇到因为更换tf版本导致在这部出现了nan

geibeile commented 4 years ago

我也没解决,这个是tensorflow版本问题导致的,建议换成pytorch版本,这个兼容性好点,希望对你有帮助

------------------ 原始邮件 ------------------ 发件人: "hqWu"<notifications@github.com>; 发送时间: 2020年4月23日(星期四) 下午5:33 收件人: "macanv/BERT-BiLSTM-CRF-NER"<BERT-BiLSTM-CRF-NER@noreply.github.com>; 抄送: "安静倾诉馨雨"<276119700@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [macanv/BERT-BiLSTM-CRF-NER] Found Inf or NaN global norm (#100)

我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行

请问您解决这个问题了吗?我在做其他任务的时候也遇到因为更换tf版本导致在这部出现了nan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

HqWu-HITCS commented 4 years ago

我也没解决,这个是tensorflow版本问题导致的,建议换成pytorch版本,这个兼容性好点,希望对你有帮助 ------------------ 原始邮件 ------------------ 发件人: "hqWu"<notifications@github.com>; 发送时间: 2020年4月23日(星期四) 下午5:33 收件人: "macanv/BERT-BiLSTM-CRF-NER"<BERT-BiLSTM-CRF-NER@noreply.github.com>; 抄送: "安静倾诉馨雨"<276119700@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [macanv/BERT-BiLSTM-CRF-NER] Found Inf or NaN global norm (#100) 我在tensorflow1.9版本运行正常,但是在tensorflow1.13版本运行,一直显示Found Inf or NaN global norm,除了更改了文件路径,其他代码并未做更改,好奇怪??????显示(gras,_)=tf.clip_by_global_norm(grads,clip=1.0)这行错误,我调整了learning_rate还是不行 请问您解决这个问题了吗?我在做其他任务的时候也遇到因为更换tf版本导致在这部出现了nan — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

收到,多谢