rokid / ELMo-chinese

Deep contextualized word representations for Chinese
152 stars 44 forks source link

足量的显存仍然出现了OOM。 #5

Closed fooSynaptic closed 5 years ago

fooSynaptic commented 5 years ago

image

76, in <module> main(args) File "train_elmo.py", line 66, in main train(options, data, n_gpus, tf_save_dir, tf_log_dir) File "/data/sde/jiaxin_hu/git_project/ELMo-chinese/bilm/training.py", line 766, in train allow_soft_placement=True)) as sess: File "/data/sde/jiaxin_hu/git_project/ELMo-chinese/bin/testenv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1494, in __init__ super(Session, self).__init__(target, graph, config=config) File "/data/sde/jiaxin_hu/git_project/ELMo-chinese/bin/testenv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 626, in __init__ self._session = tf_session.TF_NewSession(self._graph._c_graph, opts) tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

fooSynaptic commented 5 years ago

运行环境: python3 tensorflow-gpu==1.10

没有仔细阅读源码,可以运行,占用了四张RTX2080*12G的显存,但是只有一张卡上面在进行运算。

guotong1988 commented 5 years ago

train_elmo.py里的n_gpus改成4

guotong1988 commented 5 years ago

Failed to create session 好像是因为有其他进程先用了同一个GPU

fooSynaptic commented 5 years ago

@guotong1988 xiexie, 现在可以运行训练程序,不过对于显存的利用好像仍然存在一点问题,后续再post issue