nl8590687 / ASRT_SpeechRecognition

A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统
https://asrt.ailemon.net
GNU General Public License v3.0
7.85k stars 1.9k forks source link

Error with CUDA_ERROR_ILLEGAL_ADDRESS #313

Open hgneng opened 1 year ago

hgneng commented 1 year ago

I have successfully run training on a Ubuntu 22.04 without GPU.

However, I fail to run on platform.virtaicloud. Training aborted with CUDA_ERROR_ILLEGAL_ADDRESS.

# python3 train_speech_model.py 
2023-02-13 09:18:21.593221: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 09:18:22.312066: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-02-13 09:18:22.312218: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22331 MB memory:  -> device: 0, name: B1.gpu.large, pci bus id: 0000:ff:1e.0, compute capability: 8.6
[ASRT] Compiles Model Successfully.
[ASRT Training] train epoch 1/50 .
/gemini/code/speech_model.py:120: UserWarning: `Model.fit_generator` is deprecated and will be removed in a future version. Please use `Model.fit`, which supports generators.
  self.trained_model.fit_generator(yielddatas, num_iterate, callbacks=call_back)
2023-02-13 09:18:27.142189: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-02-13 09:18:27.142384: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Aborted (core dumped)

ASRT

What should I do? Anyone has idea can use the public environment mirror above to debug.

hgneng commented 1 year ago

Python version: 3.8.10 Tensorflow version: 2.8.4 Cuda version: cuda_11.2.r11.2/compiler.29618528_0

A relative issue: https://github.com/tensorflow/tensorflow/issues/50735 But I have tried CUDA_LAUNCH_BLOCKING=1 with no luck.

nl8590687 commented 1 year ago

what gpu and cpu hardware does this platform use? It seems like gpu or cpu memory too little or hardware impactive.

hgneng commented 1 year ago

硬件配置如下,理论上是够的。而且是一开始运行就崩溃。

实例规格 B1.large GPU:1 gpu(s),每个GPU显存:24 GB CPU:8 core(s),内存:24 GB

nl8590687 commented 1 year ago

tensorflow版本可以自己选吗?尝试下配置为其他版本的

hgneng commented 1 year ago

我换了一个Tensoflow 2.8.0的镜像,结果一样。不过我发现我之前用的镜像是Tensorflow 2.10.1。但是两个环境运行下面命令都返回2.8.4的版本。我怀疑我是不是不会用……

python3 -c 'import tensorflow as tf; print(tf.__version__)'

hgneng commented 1 year ago

我知道上面的问题是为什么了,Tensoflow 2.8.4的版本要求是写在requirements.txt里的。我需要改这个文件。不过我有些奇怪,为什么requirements.txt里的版本要求这么严格,都是等于某一个版本,而不能只写几个主要的,其它按依赖安装。因为现在我改Tensorflow的版本会触发其它依赖失败,需要注释若干行才能通过。

更换版本Tensorflow版本之后运行train还是报CUDA_ERROR_ILLEGAL_ADDRESS错误。查版本的时候又报了一些错误,我去问一下平台社区,也许我安装Tensorflow的方式有误。

# python3 -c 'import tensorflow as tf; print(tf.__version__)'
2023-02-17 09:58:20.598049: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-17 09:58:20.701675: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-17 09:58:22.057285: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-02-17 09:58:23.898986: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479348: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479392: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2.10.1
hgneng commented 1 year ago

试了镜像提供的Tensorflow 2.8.0, 2.9.3, 2.10.1(不是通过pip安装的),都报CUDA_ERROR_ILLEGAL_ADDRESS错误。暂时没有什么思路了。

说明一点:在镜像提供Tensorflow的前提下,我只通过pip单独安装了matplotlib和scipy,没有安装requirements.txt。我感觉requirements.txt那个列表好像不是那么必要。实际只需要装几个,其它的依赖会自动解决。