nl8590687 / ASRT_SpeechRecognition

A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统
https://asrt.ailemon.net
GNU General Public License v3.0
7.85k stars 1.9k forks source link

3090训练一段时间后出现 MemoryError: bad allocation #218

Open laoyin opened 4 years ago

laoyin commented 4 years ago

环境搭建,改成了 最新的cuda11.1 代码改成兼容 tensorflow 2.5.0-dev20201109, 运行了几个小时后,出现了 MemoryError: bad allocation

环境搭建过程: https://zhuanlan.zhihu.com/p/277569990

报错: Traceback (most recent call last): File "train_mspeech.py", line 53, in ms.TrainModel(datapath, epoch = 50, batch_size = 16, save_step = 500) File "D:\ASR_project\asr\SpeechModel251.py", line 187, in TrainModel self.TestModel(self.datapath, str_dataset='train', data_count = 4) File "D:\ASR_project\asr\SpeechModel251.py", line 250, in TestModel pre = self.Predict(data_input, data_input.shape[0] // 8) File "D:\ASR_project\asr\SpeechModel251.py", line 326, in Predict r1 = r[0][0].eval(session=tf.compat.v1.Session()) File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 921, in eval return _eval_using_default_session(self, feed_dict, self.graph, session) File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\framework\ops.py", line 5515, in _eval_using_default_session return session.run(tensors, feed_dict) File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\client\session.py", line 968, in run run_metadata_ptr) File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\client\session.py", line 1191, in _run feed_dict_tensor, options, run_metadata) File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\client\session.py", line 1369, in _do_run run_metadata) File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\client\session.py", line 1375, in _do_call return fn(*args) File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\client\session.py", line 1358, in _run_fn self._extend_graph() File "D:\ASR_project\asr\venv\lib\site-packages\tensorflow\python\client\session.py", line 1398, in _extend_graph tf_session.ExtendSession(self._session) MemoryError: bad allocation

laoyin commented 4 years ago

https://zhuanlan.zhihu.com/p/277569990 环境搭建,已经兼容代码更改。

训练模型保存到 speech_model251_e_0_step_14000 后,就出现了错误

laoyin commented 4 years ago

@nl8590687 大佬给点思路

nl8590687 commented 4 years ago

看起来你用了不稳定版的tensorflow,而且你自己应该是改了代码,我怎么知道你代码哪里有问题。我们其他人运行都是没有问题的,只要按照我写的文章教程来做就行。

nl8590687 commented 4 years ago

最后训练一段时间后又崩了,有可能是内存和显存不是很够用,随着时间增长它的memory有所增加。

laoyin commented 4 years ago

@nl8590687 3q, 用了你的教程,但是不支持高版本的cuda11,升级了tensorflow 才行。 我这边只修改了 config 和keras 的引用。 使用了 tensorflow.keras

现在大概运行6-8小时会出现,我自己再找找问题吧,看看什么地方导致的。

nl8590687 commented 4 years ago

没必要非得用CUDA 11 这么新的版本,软件要的是稳定,不是追新

laoyin commented 4 years ago

@nl8590687 大佬没办法呀, rtx 3090, 只能使用cuda11, 其他版本我试过了,GPU用不起来

nl8590687 commented 4 years ago

那应该就是硬件太新了,在新硬件架构之上的软件依赖环境都还不完善

AnthonyLuoyu commented 3 years ago

我也是一样的问题,我用1660s,跑了一个多小时也是出现MemoryError: bad allocation 问题 请问你解决了吗?

wslbeck commented 1 year ago

请问你解决了吗?