Open hgneng opened 1 year ago
Python version: 3.8.10 Tensorflow version: 2.8.4 Cuda version: cuda_11.2.r11.2/compiler.29618528_0
A relative issue: But I have tried CUDA_LAUNCH_BLOCKING=1 with no luck.
what gpu and cpu hardware does this platform use? It seems like gpu or cpu memory too little or hardware impactive.
实例规格 B1.large GPU:1 gpu(s),每个GPU显存:24 GB CPU:8 core(s),内存:24 GB
我换了一个Tensoflow 2.8.0的镜像,结果一样。不过我发现我之前用的镜像是Tensorflow 2.10.1。但是两个环境运行下面命令都返回2.8.4的版本。我怀疑我是不是不会用……
python3 -c 'import tensorflow as tf; print(tf.__version__)'
我知道上面的问题是为什么了,Tensoflow 2.8.4的版本要求是写在requirements.txt里的。我需要改这个文件。不过我有些奇怪,为什么requirements.txt里的版本要求这么严格,都是等于某一个版本,而不能只写几个主要的,其它按依赖安装。因为现在我改Tensorflow的版本会触发其它依赖失败,需要注释若干行才能通过。
# python3 -c 'import tensorflow as tf; print(tf.__version__)'
2023-02-17 09:58:20.598049: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-17 09:58:20.701675: I tensorflow/core/util/] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-17 09:58:22.057285: E tensorflow/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-02-17 09:58:23.898986: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479348: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ''; dlerror: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib/orion:/usr/lib64:/usr/lib:/usr/lib/orion
2023-02-17 09:58:24.479392: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
试了镜像提供的Tensorflow 2.8.0, 2.9.3, 2.10.1(不是通过pip安装的),都报CUDA_ERROR_ILLEGAL_ADDRESS错误。暂时没有什么思路了。
I have successfully run training on a Ubuntu 22.04 without GPU.
However, I fail to run on platform.virtaicloud. Training aborted with CUDA_ERROR_ILLEGAL_ADDRESS.
What should I do? Anyone has idea can use the public environment mirror above to debug.