If you are running into problems with TensorFlow

chogovadze commented 4 years ago

Hello everyone, It seems that several users are reporting the same kind of obstacles with regards to training/predicting. After research, this problem appears to be a compatibility issue of old versions of tensorflow 1.x and newer GPUs when installing through pip. Compiling tensorflow from source resolves this issue however it is very time-consuming. I hope this write up could help other users that are having trouble with their environment.

This method requires the use of conda.

Create a new conda environment and simply run: conda install tensorflow-gpu=1.12 (conda will automatically pull the correct cuda/cudnn versions).
Once installation is complete, remove the tensorflow-gpu==1.12 from requirement.txt and run the makefile.
Change all batch_size and eval_batch_size in the config files to 1.
Finally run export TF_FORCE_GPU_ALLOW_GROWTH=true followed by export TMPDIR=/tmp/ in your current terminal session.

If you are still having issues be sure that you have NOT:

Used an old conda environment with cuda/cudnn already configured.
Installed cuda/cudnn separately with the command conda install cudnn=x.x.x=cudax.x_x.
Run the makefile within the new conda environment before the aforementioned steps, thus installing tensorflow through pip.

References from:

21
35
48
96
148
149
Compiling
Inverse

I have successfully worked with this repository with the following setup:

Ubuntu 18.04
Ryzen 3700
GTX 2070s (8GB)

If you are still having some issues, please do not hesitate to reach out.

paragghosh commented 3 years ago

@chogovadze , Thanks for outlining the steps here. I was having the same issues described here and followed the steps to fix the TF version and CUDA version incompatibility. After finishing these steps I got an error when I tried to run superpoint (script export_detections.py):
ImportError: No module named 'superpoint'
Following the thread https://github.com/rpautrat/SuperPoint/issues/206 I did another round of make install. It finished fine but I am still getting the same error. Any ideas?

paragghosh commented 3 years ago

I realized my error - I was pointing to my earlier venv in the makefile. After I removed that, I reran make install (which reinstalled superpoint). However, now I am getting the following error when I try to run the export_detections.py script:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

David-Willo commented 1 year ago

For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.

iMeleon commented 1 year ago

For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.

Thanks. Solve my issue with loss nan, precision nan, recall 0.0000 on RTX 3090.

20181313zhang commented 9 months ago

对于那些在无法与较低版本的 CUDA（就我而言为 3080）相匹配的 GPU 上运行困难的人，请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。

谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。

你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢

vegetable233 commented 2 months ago

对于那些在无法与较低版本的 CUDA（就我而言为 3080）相匹配的 GPU 上运行困难的人，请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。

谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。

你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢

我在训练magicpoint的时候也遇到了loss nan的问题，请问您解决了吗？可以加QQ 972048746联系一下

GoroYeh-HRI commented 5 days ago

Hi,

Thanks for sharing the solution. I successfully run the training of MagicPoint. However, in the end I got loss=nan, precision=nan, recall=0.0000 issue.

Do you know what might be the reason?

rpautrat / SuperPoint

If you are running into problems with TensorFlow #173

21

35

48

96

148

149