kyzhouhzau / BERT-NER

Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).
MIT License
1.23k stars 335 forks source link

CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: #32

Open ghost opened 5 years ago

ghost commented 5 years ago

I am getting out of memory error. I am running in prediction mode. Tried even reducing the max_seq_length as well but still same error.

python BERT_NER.py --task_name="NER" --do_train=False --do_eval=False --do_predict=True --data_dir=NERdata --vocab_file=/home/local/BSSTEST/sandeep.a.bhutani/bucket_copy/uncased_L-12_H-768_A-12/vocab.txt --bert_config_file=/home/local/BSSTEST/sandeep.a.bhutani/bucket_copy/uncased_L-12_H-768_A-12/bert_config.json --init_checkpoint=/home/local/BSSTEST/sandeep.a.bhutani/bucket_copy/uncased_L-12_H-768_A-12/bert_model.ckpt --max_seq_length=16 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --output_dir=./output/result_dir/


INFO:tensorflow:prediction_loop marked as finished
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
  File "BERT_NER.py", line 613, in <module>
    tf.app.run()
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "BERT_NER.py", line 603, in main
    for prediction in result:
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2446, in predict
    rendezvous.raise_errors()
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
    six.reraise(typ, value, traceback)
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2440, in predict
    yield_single_examples=yield_single_examples):
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 593, in predict
    hooks=all_hooks) as mon_sess:
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
    init_fn=self._scaffold.init_fn)
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 288, in prepare_session
    config=config)
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py", line 185, in _restore_checkpoint
    sess = session.Session(self._target, graph=self._graph, config=config)
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1551, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File "/anaconda3/envs/bertenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 676, in __init__
    self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 16914055168