Open hgneng opened 1 year ago
Python version: 3.8.10 Tensorflow version: 2.8.4 Cuda version: cuda_11.2.r11.2/compiler.29618528_0
A relative issue: https://github.com/tensorflow/tensorflow/issues/50735 But I have tried CUDA_LAUNCH_BLOCKING=1 with no luck.
what gpu and cpu hardware does this platform use? It seems like gpu or cpu memory too little or hardware impactive.
硬件配置如下,理论上是够的。而且是一开始运行就崩溃。
实例规格 B1.large GPU:1 gpu(s),每个GPU显存:24 GB CPU:8 core(s),内存:24 GB
tensorflow版本可以自己选吗?尝试下配置为其他版本的
我换了一个Tensoflow 2.8.0的镜像,结果一样。不过我发现我之前用的镜像是Tensorflow 2.10.1。但是两个环境运行下面命令都返回2.8.4的版本。我怀疑我是不是不会用……
python3 -c 'import tensorflow as tf; print(tf.__version__)'
我知道上面的问题是为什么了,Tensoflow 2.8.4的版本要求是写在requirements.txt里的。我需要改这个文件。不过我有些奇怪,为什么requirements.txt里的版本要求这么严格,都是等于某一个版本,而不能只写几个主要的,其它按依赖安装。因为现在我改Tensorflow的版本会触发其它依赖失败,需要注释若干行才能通过。
更换版本Tensorflow版本之后运行train还是报CUDA_ERROR_ILLEGAL_ADDRESS错误。查版本的时候又报了一些错误,我去问一下平台社区,也许我安装Tensorflow的方式有误。
# python3 -c 'import tensorflow as tf; print(tf.__version__)'
2023-02-17 09:58:20.598049: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-17 09:58:20.701675: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-17 09:58:22.057285: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-02-17 09:58:23.898986: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479348: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479392: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2.10.1
试了镜像提供的Tensorflow 2.8.0, 2.9.3, 2.10.1(不是通过pip安装的),都报CUDA_ERROR_ILLEGAL_ADDRESS错误。暂时没有什么思路了。
说明一点:在镜像提供Tensorflow的前提下,我只通过pip单独安装了matplotlib和scipy,没有安装requirements.txt。我感觉requirements.txt那个列表好像不是那么必要。实际只需要装几个,其它的依赖会自动解决。
I have successfully run training on a Ubuntu 22.04 without GPU.
However, I fail to run on platform.virtaicloud. Training aborted with CUDA_ERROR_ILLEGAL_ADDRESS.
What should I do? Anyone has idea can use the public environment mirror above to debug.