neuronets / kwyk

Knowing what you know - Bayesian brain parcellation
https://doi.org/10.3389/fninf.2019.00067
Apache License 2.0
20 stars 9 forks source link

[WIP] fix Dockerfile.gpu #34

Open Hoda1394 opened 2 years ago

Hoda1394 commented 2 years ago

I have tried so many different things to address this issue #33 and among them, this Dockerfile can be built without error but when I run the container, TensorFlow does not see the gpu and runs on cpu! This container image is available in docker hub as hodadock/kwyk:gpu_test

@satra, @kaczmarj -Any idea how we can fix it?

kaczmarj commented 2 years ago

@Hoda1394 - what command are you using to run the container? i don't have any experience running a docker image with a gpu; i've only used apptainer/singularity with gpu.

a few potential problems come to mind (not saying that any of these are present here):

  1. the cuda/cudnn versions are not appropriate for the installed tensorflow version (though using the official tensorflow image should prevent this issue).
  2. the docker run command is not correct. i'm assuming there are some extra flags that need to be added to use gpu.
  3. the nvidia drivers are too old on the host system (possible, but unlikely because this container uses tensorflow 1.x, which has been around for several years now).

another point -- you can test whether a gpu is available with tf.test.is_gpu_available(). tensorflow 2.x also has tf.config.list_physical_devices("GPU") but not sure if 1.x has it.

another thought -- try validating that the official tensorflow image can use the gpu. so run the tensorflow/tensorflow:1.12.3-gpu-py3 image in a way that should use the gpu and test that it actually sees the gpu. if the container sees the gpu, the problem is somewhere in the dockerfile.

Hoda1394 commented 2 years ago

Actually, I was running the singularity conversion of this image with gpu. the gpu is visible inside the container but TensorFlow can't see it. I tested with the official image and tf.test.is_gpu_available() returns False. So, it seems that the issue is related to the base image!

Hoda1394 commented 2 years ago

As additional info when I run pip list |grep tensorflow inside the container, I get

tensorflow          1.12.3                
tensorflow-gpu      1.12.0 

there are two versions of TensorFlow installed. not sure if this can cause this issue...

kaczmarj commented 2 years ago

As additional info when I run pip list |grep tensorflow inside the container, I get

tensorflow          1.12.3                
tensorflow-gpu      1.12.0 

there are two versions of TensorFlow installed. not sure if this can cause this issue...

this is probably the problem (or one of them!). can you try pip list with the base image? see which one is present. and see if the base image can see the gpu.

Hoda1394 commented 2 years ago

I tried this with the base image and saw both. when running python and import tf, the tf.__version__ returns 1.12.3 . So, it seems that tensorflow is getting imported rather than tensorflow-gpu I tried to uninstall it inside the container but I was not successful.

kaczmarj commented 2 years ago

i can reproduce this... it could be a problem with the 1.12.3-gpu-py3 docker image. why are we using such an old image anyway?

docker run --rm tensorflow/tensorflow:1.12.3-gpu-py3 python -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())'
False

the 1.14.0-gpu-py3 image works.

docker run --rm tensorflow/tensorflow:1.14.0-gpu-py3 python -c 'import tensorflow as tf; print(tf.test.is_built_with_cuda())'
True

we should probably use a newer image. i realize we used 1.12 in the project, but we can test if everything works correctly with 1.15 (the last release of the 1.x series).

Hoda1394 commented 2 years ago

I tried removing tensorflow during the build and tensorflow-gpu doesn't work properly without it. I already tested the tensorflow1.15 and I got some other errors due to the version mismatch so if we want to use tensorflow1.15 we may need to update the code.

Hoda1394 commented 2 years ago

I will try version 1.14.0-gpu-py3 also.

kaczmarj commented 2 years ago

feel free to post any errors you get when trying newer versions. paste the entire traceback and i can take a look