rpautrat / SuperPoint

Efficient neural feature detector and descriptor
MIT License
1.88k stars 415 forks source link

If you are running into problems with TensorFlow #173

Open chogovadze opened 3 years ago

chogovadze commented 3 years ago

Hello everyone, It seems that several users are reporting the same kind of obstacles with regards to training/predicting. After research, this problem appears to be a compatibility issue of old versions of tensorflow 1.x and newer GPUs when installing through pip. Compiling tensorflow from source resolves this issue however it is very time-consuming. I hope this write up could help other users that are having trouble with their environment.

This method requires the use of conda.

  1. Create a new conda environment and simply run: conda install tensorflow-gpu=1.12 (conda will automatically pull the correct cuda/cudnn versions).
  2. Once installation is complete, remove the tensorflow-gpu==1.12 from requirement.txt and run the makefile.
  3. Change all batch_size and eval_batch_size in the config files to 1.
  4. Finally run export TF_FORCE_GPU_ALLOW_GROWTH=true followed by export TMPDIR=/tmp/ in your current terminal session.

If you are still having issues be sure that you have NOT:

References from:

I have successfully worked with this repository with the following setup:

If you are still having some issues, please do not hesitate to reach out.

paragghosh commented 3 years ago

@chogovadze , Thanks for outlining the steps here. I was having the same issues described here and followed the steps to fix the TF version and CUDA version incompatibility. After finishing these steps I got an error when I tried to run superpoint (script export_detections.py):
ImportError: No module named 'superpoint'
Following the thread https://github.com/rpautrat/SuperPoint/issues/206 I did another round of make install. It finished fine but I am still getting the same error. Any ideas?

paragghosh commented 3 years ago

I realized my error - I was pointing to my earlier venv in the makefile. After I removed that, I reran make install (which reinstalled superpoint). However, now I am getting the following error when I try to run the export_detections.py script:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

David-Willo commented 11 months ago

For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.

iMeleon commented 11 months ago

For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.

Thanks. Solve my issue with loss nan, precision nan, recall 0.0000 on RTX 3090.

20181313zhang commented 7 months ago

对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。

谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。

你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢

vegetable233 commented 1 month ago

对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。

谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。

你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢

我在训练magicpoint的时候也遇到了loss nan的问题,请问您解决了吗?可以加QQ 972048746联系一下