Closed nick-torenvliet closed 3 years ago
Is this due to tensroflow being bumped up - as per the closed issues?
Hi,
It sounds like the problem is due to incompatible tf version / cuda / cudnn / GPU model. Probably, one can try this solution: https://github.com/tensorflow/tensorflow/issues/24496#issuecomment-455265295 , it's a pretty popular problem. As far as I remember, I've successfully tested the current configuration about a year ago.
Thanks for that,
Got it running on 1.15.0 with the items listed in requirements.txt on a docker container.
As per the link you included, I added: config = tf.compat.v1.ConfigProto() config.gpu_options.allow_growth = True sess = tf.compat.v1.Session(config=config)
Just after: tf.compat.v1.enable_eager_execution()
in train.py
It appears to be running on a single GPU.
Does this model run on multiple GPU?
I don't think so. We didn't run multi-gpu training
Hi,
I'm in a tensorflow 1.15.0, with python 3 docker container. GPUs seem to be working fine for simple test tasks e.g. tf can see four gpus and I can load the up.
The requirements are all installed.
When I run CUDA_VISIBLE_DEVICES=* python train.py --model_type gp-vae --data_type physionet --exp_name asdf
Everything works fine - it cylces through calculation - but it runs on CPU only.
When I run anything else e.g. python train.py --model_type gp-vae --data_type physionet --exp_name asdf or CUDA_VISIBLE_DEVICES=1 python train.py --model_type gp-vae --data_type physionet --exp_name asdf
Of any of the reference parameters sets, it bails as per below. Its not batch size... I've already played with that.
Any ideas?
2021-05-12 21:11:18.088886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-05-12 21:11:18.088901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-05-12 21:11:18.088915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-05-12 21:11:18.088928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-05-12 21:11:18.088941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-05-12 21:11:18.088954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-05-12 21:11:18.088970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-05-12 21:11:18.089512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2021-05-12 21:11:18.089553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-05-12 21:11:18.089562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2021-05-12 21:11:18.089570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2021-05-12 21:11:18.090103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 10320 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5) GPU support: True Training... 2021-05-12 21:11:18.097352: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x67c3e70 2021-05-12 21:11:18.097415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-05-12 21:11:18.447486: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-05-12 21:11:18.735398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-05-12 21:11:20.163447: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2021-05-12 21:11:20.170211: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "train.py", line 473, in
app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "train.py", line 239, in main
trainable_vars = model.get_trainable_vars()
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 325, in get_trainable_vars
tf.zeros(shape=(1, self.time_length, self.data_dim), dtype=tf.float32))
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 332, in compute_loss
return self._compute_loss(x, m_mask=m_mask, return_parts=return_parts)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 280, in _compute_loss
qz_x = self.encode(x)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 220, in encode
return self.encoder(x)
File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 51, in call
mapped = self.net(x)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call
outputs = self.call(cast_inputs, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/sequential.py", line 269, in call
outputs = layer(inputs, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call
outputs = self.call(cast_inputs, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 387, in call
return super(Conv1D, self).call(inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 197, in call
outputs = self._convolution_op(inputs, self.kernel)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1134, in call
return self.conv_op(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 639, in call
return self.call(inp, filter)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 238, in call
name=self.name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 227, in _conv1d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
return func(*args, *kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func
return func(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1681, in conv1d
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1031, in conv2d
data_format=data_format, dilations=dilations, name=name, ctx=_ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1130, in conv2d_eager_fallback
ctx=_ctx, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D] name: sequential/conv1d/conv1d/