ratschlab / GP-VAE

TensorFlow implementation for the GP-VAE model described in https://arxiv.org/abs/1907.04155
MIT License
124 stars 27 forks source link

Problems running code with GPU support. #10

Closed nick-torenvliet closed 3 years ago

nick-torenvliet commented 3 years ago

Hi,

I'm in a tensorflow 1.15.0, with python 3 docker container. GPUs seem to be working fine for simple test tasks e.g. tf can see four gpus and I can load the up.

The requirements are all installed.

When I run CUDA_VISIBLE_DEVICES=* python train.py --model_type gp-vae --data_type physionet --exp_name asdf

Everything works fine - it cylces through calculation - but it runs on CPU only.

When I run anything else e.g. python train.py --model_type gp-vae --data_type physionet --exp_name asdf or CUDA_VISIBLE_DEVICES=1 python train.py --model_type gp-vae --data_type physionet --exp_name asdf

Of any of the reference parameters sets, it bails as per below. Its not batch size... I've already played with that.

Any ideas?

2021-05-12 21:11:18.088886: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2021-05-12 21:11:18.088901: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-05-12 21:11:18.088915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2021-05-12 21:11:18.088928: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2021-05-12 21:11:18.088941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-05-12 21:11:18.088954: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2021-05-12 21:11:18.088970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-05-12 21:11:18.089512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2021-05-12 21:11:18.089553: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2021-05-12 21:11:18.089562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2021-05-12 21:11:18.089570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2021-05-12 21:11:18.090103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/device:GPU:0 with 10320 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:5e:00.0, compute capability: 7.5) GPU support: True Training... 2021-05-12 21:11:18.097352: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x67c3e70 2021-05-12 21:11:18.097415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2021-05-12 21:11:18.447486: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2021-05-12 21:11:18.735398: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2021-05-12 21:11:20.163447: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2021-05-12 21:11:20.170211: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR Traceback (most recent call last): File "train.py", line 473, in app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "train.py", line 239, in main trainable_vars = model.get_trainable_vars() File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 325, in get_trainable_vars tf.zeros(shape=(1, self.time_length, self.data_dim), dtype=tf.float32)) File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 332, in compute_loss return self._compute_loss(x, m_mask=m_mask, return_parts=return_parts) File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 280, in _compute_loss qz_x = self.encode(x) File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 220, in encode return self.encoder(x) File "/home/torenvln/gp-vae/GP-VAE/lib/models.py", line 51, in call mapped = self.net(x) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call outputs = self.call(cast_inputs, args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/sequential.py", line 269, in call outputs = layer(inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 898, in call outputs = self.call(cast_inputs, args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 387, in call return super(Conv1D, self).call(inputs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/layers/convolutional.py", line 197, in call outputs = self._convolution_op(inputs, self.kernel) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1134, in call return self.conv_op(inp, filter) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 639, in call return self.call(inp, filter) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 238, in call name=self.name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 227, in _conv1d name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func return func(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 574, in new_func return func(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 1681, in conv1d name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1031, in conv2d data_format=data_format, dilations=dilations, name=name, ctx=_ctx) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_nn_ops.py", line 1130, in conv2d_eager_fallback ctx=_ctx, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute six.raise_from(core._status_to_exception(e.code, message), None) File "", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [Op:Conv2D] name: sequential/conv1d/conv1d/

nick-torenvliet commented 3 years ago

Is this due to tensroflow being bumped up - as per the closed issues?

dbaranchuk commented 3 years ago

Hi,

It sounds like the problem is due to incompatible tf version / cuda / cudnn / GPU model. Probably, one can try this solution: https://github.com/tensorflow/tensorflow/issues/24496#issuecomment-455265295 , it's a pretty popular problem. As far as I remember, I've successfully tested the current configuration about a year ago.

nick-torenvliet commented 3 years ago

Thanks for that,

Got it running on 1.15.0 with the items listed in requirements.txt on a docker container.

As per the link you included, I added: config = tf.compat.v1.ConfigProto() config.gpu_options.allow_growth = True sess = tf.compat.v1.Session(config=config)

Just after: tf.compat.v1.enable_eager_execution()

in train.py

It appears to be running on a single GPU.

Does this model run on multiple GPU?

dbaranchuk commented 3 years ago

I don't think so. We didn't run multi-gpu training