matthewearl / deep-anpr

Using neural networks to build an automatic number plate recognition system
MIT License
1.84k stars 697 forks source link

train.py: could not create cudnn handle #30

Closed erdalpekel closed 7 years ago

erdalpekel commented 7 years ago

My System: Ubuntu 16.04 Nvidia GeForce GTX 1050 Ti Driver version 375 CUDA 8.0 cudnn 6.0 compiled tensorflow from sources with compute capabilities 6.1

I'm running into the following issue when running the train.py script:

2017-04-09 10:19:58.319301: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2017-04-09 10:19:58.319576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties: name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate (GHz) 1.4175 pciBusID 0000:01:00.0 Total memory: 3.94GiB Free memory: 3.90GiB 2017-04-09 10:19:58.319593: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0 2017-04-09 10:19:58.319599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y 2017-04-09 10:19:58.319606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0) 2017-04-09 10:19:58.357462: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices 2017-04-09 10:19:58.357490: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices 2017-04-09 10:19:58.357965: I tensorflow/compiler/xla/service/service.cc:183] XLA service 0x38470e0 executing computations on platform Host. Devices: 2017-04-09 10:19:58.357980: I tensorflow/compiler/xla/service/service.cc:191] StreamExecutor device (0): <undefined>, <undefined> 2017-04-09 10:19:58.358132: I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices 2017-04-09 10:19:58.358143: I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices 2017-04-09 10:19:58.358574: I tensorflow/compiler/xla/service/service.cc:183] XLA service 0x3840180 executing computations on platform CUDA. Devices: 2017-04-09 10:19:58.358587: I tensorflow/compiler/xla/service/service.cc:191] StreamExecutor device (0): **GeForce GTX 1050 Ti, Compute Capability 6.1** 2017-04-09 10:19:59.687378: E tensorflow/stream_executor/cuda/cuda_dnn.cc:359] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2017-04-09 10:19:59.687419: E tensorflow/stream_executor/cuda/cuda_dnn.cc:326] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2017-04-09 10:19:59.687428: F tensorflow/core/kernels/conv_ops.cc:659] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)

I found several related issues in the tensorflow github repository. I guess that my cudnn v6.0 is the issue here. Should I just downgrade to cudnn v5.1 or is there another workaround ?

erdalpekel commented 7 years ago

Ok. I uninstalled cudnn v6 and installed cudnn v5 and compiled tensorflow r1.1 again. Getting the same error.

erdalpekel commented 7 years ago

I solved the problem: in train.py the gpu memory setting of 0.95 was causing it. After setting it to 0.9 the training began.

WaGjUb commented 5 years ago

I had the same problem and change from 0.95 to 0.90 fix the problem to me too! thanks @erdalpekel