Closed ouening closed 5 years ago
Hey bro, have you figured it out? I met the same issue.
i also met the same issue when i run the yolo-v3,did you solve this problem?
I met the same error! My GPU is RTX2080 8G * 2,tensorflow-gpu:1.12,keras2.2.4, Ubuntu18.04. Can somebody solve it?
Try the following statement at the beginning of the code.
import keras.backend as K
cfg = K.tf.ConfigProto(gpu_options={'allow_growth': True})
K.set_session(K.tf.Session(config=cfg))
Try the following statement at the beginning of the code.
import keras.backend as K cfg = K.tf.ConfigProto(gpu_options={'allow_growth': True}) K.set_session(K.tf.Session(config=cfg))
Hi, I still got some errors:
Load weights model_data/yolo_weights.h5. Freeze the first 249 layers of total 252 layers. Train on 3439 samples, val on 382 samples, with batch size 32. Epoch 1/50 2019-03-24 10:53:58.419070: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "train.py", line 206, in <module> _main() File "train.py", line 81, in _main verbose=1) File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator class_weight=class_weight) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1217, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__ return self._call(inputs) File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(*array_vals) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1439, in __call__ run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=1384448, n=32, k=64 [[{{node conv2d_3/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@batch_normalization_3/cond/FusedBatchNorm/Switch"], data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](leaky_re_lu_2/LeakyRelu, conv2d_3/kernel/read)]] [[{{node yolo_loss/while_1/LoopCond/_2963}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6607_yolo_loss/while_1/LoopCond", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopyolo_loss/while_1/strided_slice_1/stack_2/_2805)]]
any solution for it?
hello i met the same error, my env is cuda9.0 cudnn 7.4 tensorflow-gpu1.12.0,my gpu is RTX 2080, this is my work computer, but my own computer has same env only gpu is 940 can run same project well,how can i do with this error,someone can help me?
I think that it is a bug of RTX 2080 and I have not figured it out. If you get some progress about this issue, get in touch with me please. Thanks a lot
发自我的 iPhone
在 2019年3月29日,上午10:49,HanGaaaaa notifications@github.com 写道:
hello i met the same error, my env is cuda9.0 cudnn 7.4 tensorflow-gpu1.12.0,my gpu is RTX 2080, this is my work computer, but my own computer has same env only gpu is 940 can run same project well,how can i do with this error,someone can help me?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
i also met same error, my gpu is RTX 2080ti, tensorflow-gpu 1.8.0, cuda 9.0, but in the GTX 1080ti, tensorflow-gpu 1.4.0, cuda 8.0, the program can run normally. Can someone give some advice? thanks
I have solved this problem: Install patchs for cuda9, there are 4 patchs that can be download from website:cuda9 patchs
Hello! Did you solve it?How?
Hello! Did you solve it?How?
I fixed this issue just by installing the CUDA Toolkit patch. https://developer.nvidia.com/cuda-90-download-archive?target_os=Windows&target_arch=x86_64&target_version=10&target_type=exenetwork (choose your CUDA version)
I have installed the CUDA Toolkit patch but still having this problem
I have same issue, same code running on K80 but not RTX2080
same issue on my Titan RTX.
It works after I update the tensorflow version from 1.13.1
to 1.14
.
My cuda version is 10.0
, cudnn version is 7.6.3
, the gpu is RTX2080
It works after I update the tensorflow version from 1.13.1 to 1.14.
My cuda version is 10.0, cudnn version is 7.6.3, the gpu is RTX2080
After I made a change follow the above I still got the problem like the following:
E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
@xiaohai-AI try this
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
It works after I update the tensorflow version from
1.13.1
to1.14
.My cuda version is
10.0
, cudnn version is7.6.3
, the gpu is RTX2080
I was also getting the same error in tensorflow -gpu 1.6.0 cuda 9.0. Upgrading to cuda 10.0 and tensorflow -gpu 1.14.0 . Solved the issue for me. Thanks @xiaohai-AI. Not sure why you are getting internal errot hough. Probably because you have two cuda versions or maybe because tensorflow is picking up wrong version of cudnn
It works after I update the tensorflow version from
1.13.1
to1.14
.My cuda version is
10.0
, cudnn version is7.6.3
, the gpu is RTX2080
But my tensorflow is 1.15, cuda is 10.0, gpu is RTX 3080, still have the same issue.
hey @mfshiu maybe you can try cuda 10.0 with tensorflow-gpu 1.14
hi @mfshiu, NVIDIA maintains its own version of tensorflow 1.15 here: https://github.com/NVIDIA/tensorflow#install , which support latest gpu card.
So, you need to remove official tensorflow which installed through pip or conda, and install nvidia's version, as its README.md says:
install the NVIDIA wheel index:
$ pip install --user nvidia-pyindex
install the current NVIDIA Tensorflow release:
$ pip install --user nvidia-tensorflow[horovod]
after installed, just use it as regular tensorflow:
import tensorflow as tf
Hey @allenyllee I wonder if you might be able to clarify or help: When I follow those install instructions for the NVIDIA-tensorflow, I get a long error that tells me...to re-do what I just did?
$ pip install --user nvidia-pyindex
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nvidia-pyindex
Downloading nvidia-pyindex-1.0.6.tar.gz (6.7 kB)
Building wheels for collected packages: nvidia-pyindex
Building wheel for nvidia-pyindex (setup.py) ... done
Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.6-py3-none-any.whl size=4171 sha256=692df4078194418f4812516403399f2e96373ad780b93c98ce944b5f02efb35d
Stored in directory: /tmp/pip-ephem-wheel-cache-kpx26e3z/wheels/52/31/c8/db9f8939a8bb1f3500ce81b630604cbfa6e31f82c8f1bd914d
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.6
$ pip install --user nvidia-tensorflow[horovod]
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting nvidia-tensorflow[horovod]
Downloading nvidia-tensorflow-0.0.1.dev4.tar.gz (3.8 kB)
ERROR: Command errored out with exit status 1:
command: /home/shawley/anaconda3/envs/spnet/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-yv_vnm57/nvidia-tensorflow/setup.py'"'"'; __file__='"'"'/tmp/pip-install-yv_vnm57/nvidia-tensorflow/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-1hvhhg4h
cwd: /tmp/pip-install-yv_vnm57/nvidia-tensorflow/
Complete output (17 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-yv_vnm57/nvidia-tensorflow/setup.py", line 150, in <module>
raise RuntimeError(open("ERROR.txt", "r").read())
RuntimeError:
###########################################################################################
The package you are trying to install is only a placeholder project on PyPI.org repository.
This package is hosted on NVIDIA Python Package Index.
This package can be installed as:
$ pip install nvidia-pyindex
$ pip install nvidia-tensorflow
```
Please refer to NVIDIA instructions: https://github.com/NVIDIA/tensorflow#install.
###########################################################################################
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
Re-running those "This package can be installed as:" commands just results in the same error message again.
Resolved this issue for myself: Be sure you're running Python 3.8 and Pip 20 or later.
I had the same problem with an RTX 3090 + TF 1.15. I resolved my problem by using the official nvidia+tf1 ngc docker container, available here: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
I had the same problem with an RTX 3090 + TF 1.15. I resolved my problem by using the official nvidia+tf1 ngc docker container, available here: https://ngc.nvidia.com/catalog/containers/nvidia:tensorflow
It works very well to me, in my case with RTX 3090 +TF 1.15, nvidia+tf1 ngc docker container version '21.05-tf1-py3' works very well! Thanks alot.
It works after I update the tensorflow version from
1.13.1
to1.14
. My cuda version is10.0
, cudnn version is7.6.3
, the gpu is RTX2080But my tensorflow is 1.15, cuda is 10.0, gpu is RTX 3080, still have the same issue.
me too!!!!!!. have you solved this problem?
It works after I update the tensorflow version from
1.13.1
to1.14
. My cuda version is10.0
, cudnn version is7.6.3
, the gpu is RTX2080But my tensorflow is 1.15, cuda is 10.0, gpu is RTX 3080, still have the same issue.
me too!!!!!!. have you solved this problem?
please find a version that matches your GPU version in nvidia-docker hub
i found the same question on a10 GPU, that 30-, a10, a100, etc. which compute capacity is more than 8.0 must use CUDA11.x, so you could't use tensorflow1.x which match CUDA10 or lower. some solution is that, use nvidia-tensorflow1.x and could use CUDA11.x to accelerate. download here: https://github.com/NVIDIA/tensorflow#install thanks to @allenyllee.
Problem fixed after installed !pip install nvidia-pyindex !pip install nvidia-tensorflow
hi @mfshiu, NVIDIA maintains its own version of tensorflow 1.15 here: https://github.com/NVIDIA/tensorflow#install , which support latest gpu card.
So, you need to remove official tensorflow which installed through pip or conda, and install nvidia's version, as its README.md says:
install the NVIDIA wheel index:
$ pip install --user nvidia-pyindex
install the current NVIDIA Tensorflow release:
$ pip install --user nvidia-tensorflow[horovod]
after installed, just use it as regular tensorflow:
import tensorflow as tf
It works for me!!! Thanks a lot~ The tf version of NVIDA is 1.15, but luckily my codes can run successfully on tf==1.15~ Btw,my error environment are "tf==1.12.0, 3090, cuda==9.0, ubuntu20.04".
Problem fixed after installed !pip install nvidia-pyindex !pip install nvidia-tensorflow
Thanks! It works for me~
Cool!! It fixes perfectly my issue! Thanks!
Yes! Yes!!! Remove official tensorflow. Python3.8
pip install nvidia-pyindex
pip install nvidia-tensorflow
I used A6000, tf1.15, cuda10.0.130, cudnn7.3.1, and TF website let me use python 3.6 or 3.7, that's what I did before. But!!! For using nvidia-pyindex and nvidia-tensorflow, I need to change python to 3.8. And I succeed!!!
这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。
hi @mfshiu, NVIDIA maintains its own version of tensorflow 1.15 here: https://github.com/NVIDIA/tensorflow#install , which support latest gpu card.你好,NVIDIA在这里维护自己的tensorflow 1.15版本:https://github.com/NVIDIA/tensorflow#install,它支持最新的gpu卡。
So, you need to remove official tensorflow which installed through pip or conda, and install nvidia's version, as its README.md says:因此,您需要删除通过pip或conda安装的官方tensorflow,并安装nvidia的版本,如其README.md所述:
install the NVIDIA wheel index:安装 NVIDIA 轮索引:
$ pip install --user nvidia-pyindex
install the current NVIDIA Tensorflow release:安装当前的 NVIDIA Tensorflow 版本:
$ pip install --user nvidia-tensorflow[horovod]
after installed, just use it as regular tensorflow:安装后,只需将其用作常规张量流即可:
import tensorflow as tf
Thanks! Very Thanks! It has solved my problems. InternalError (see above for traceback): Blas xGEMMBatched launch failed : a.shape=[128,3,3], b.shape=[128,3,3], m=3, n=3, k=3, batch_size=128 [[node rotation/MatMul_1 ...... = BatchMatMul[T=DT_DOUBLE, adj_x=false, adj_y=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](rotation/concat_7, rotation/concat_7)]] [[{{node gradients/decoder/dgcnn_trans_fc1/MatMul_grad/tuple/control_dependency_1/_171}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge2202...pendency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
By the way, my device is A6000 and 4090 all have this problem, and now solved it , my tensorflow is 1.12.0. cuda is 9.0
这是来自QQ邮箱的假期自动回复邮件。你好,我最近正在休假中,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。
When I train voc data, the error happened. My GPU is RTX2080 8G * 2,tensorflow-gpu:1.12,keras2.2.4
Epoch 1/50 2019-01-28 00:16:00.441512: E tensorflow/stream_executor/cuda/cuda_blas.cc:652] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED Traceback (most recent call last): File "train.py", line 192, in <module> _main(annotation_path=anno) File "train.py", line 65, in _main callbacks=[logging, checkpoint]) File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator class_weight=class_weight) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1217, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__ return self._call(inputs) File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(*array_vals) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1439, in __call__ run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__ c_api.TF_GetCode(self.status.status)) tensorflow.python.framework.errors_impl.InternalError: Blas SGEMM launch failed : m=346112, n=32, k=64 [[{{node conv2d_3/convolution}} = Conv2D[T=DT_FLOAT, _class=["loc:@batch_normalization_3/cond/FusedBatchNorm/Switch"], data_format="NHWC", dilations=[1, 1, 1, 1], padding="SAME", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](leaky_re_lu_2/LeakyRelu, conv2d_3/kernel/read)]] [[{{node yolo_loss/while_1/LoopCond/_2963}} = _HostRecv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_6607_yolo_loss/while_1/LoopCond", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopyolo_loss/while_1/strided_slice_1/stack_2/_2805)]]