Closed znruss closed 10 months ago
This is the correct behaviour, the help link needs updating however. The correct link is now: https://github.com/nanoporetech/medaka#improving-parallelism
Why's it saying I'm not using a GPU though? running nvidia-smi on this docker container shows that the RTX 3090 is visible in the container.
Also the divide by zeros feel like an incorrect behavior.
Over the years we've learnt its difficult to diagnose GPU errors remotely, so you might be on your own here. Something I would say however is that the speed of the processing would indicate that a GPU is being used. Can you watch the output of nvidia-smi while medaka is running and see if anything pops into existence?
The divide by zeros result from some niggly floating point precision issues. I'll see about fixing those.
No worries - CUDA errors are absolutely the most annoying things I have worked with. I ran nvidia-smi -lms 500
and grepped history as the program ran. Nothing showed up besides the desktop window manager. Thanks again for the insight on the divide by zero.
By way of troubleshooting, I ran: docker run --rm --gpus all ontresearch/medaka:latest python3 -c "import tensorflow as tf;print(tf.config.list_physical_devices('GPU'))"
And got back []
Also tried docker run --rm --gpus all ontresearch/medaka:latest python3 -c "import tensorflow as tf;print(tf.sysconfig.get_build_info(),tf.__version__)"
Got back OrderedDict([('is_cuda_build', False), ('is_rocm_build', False), ('is_tensorrt_build', False)]) 2.7.1
Whereas docker run --rm --gpus all nvcr.io/nvidia/tensorflow:23.03-tf2-py3 python3 -c "import tensorflow as tf;print(tf.config.list_physical_devices('GPU'))"
Got back [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Also tried docker run --rm --gpus all nvcr.io/nvidia/tensorflow:23.03-tf2-py3 python3 -c "import tensorflow as tf;print(tf.sysconfig.get_build_info(),tf.__version__)"
Got back OrderedDict([('cpu_compiler', '/opt/rh/devtoolset-9/root/usr/bin/gcc'), ('cuda_compute_capabilities', ['sm_52', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_90']), ('cuda_version', '12.1'), ('cudnn_version', '8'), ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)]) 2.11.0
I've tried finding some tensorflow builds that support the cuda I'm running and so far that's the only convenient one (even pip3 install tensorflow on bare metal gave 2.12 but not compiled for cuda). For what it's worth, building the container off of nvcr.io/nvidia/tensorflow:23.03-tf2-py3 may be useful to solve the problem. But CPU-only performance does seem pretty good. Thanks again for looking into it!
Running your examples above I get results that I would expect:
$ docker run --env TF_CPP_MIN_LOG_LEVEL=3 --rm --gpus all ontresearch/medaka:latest python3 -c "import tensorflow as tf;print(tf.config.list_physical_devices('GPU'))"
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
$ docker run --env TF_CPP_MIN_LOG_LEVEL=3 --rm --gpus all ontresearch/medaka:latest python3 -c "import tensorflow as tf;print(tf.sysconfig.get_build_info(),tf.__version__)"
OrderedDict([('cpu_compiler', '/dt9/usr/bin/gcc'), ('cuda_compute_capabilities', ['sm_35', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'compute_80']), ('cuda_version', '11.2'), ('cudnn_version', '8'), ('is_cuda_build', True), ('is_rocm_build', False), ('is_tensorrt_build', True)]) 2.10.1
Describe the bug Ran the following command:
Got the following concerning lines back:
Logging
Environment (if you do not have a GPU, write No GPU):
Additional context Add any other context about the problem here.