Open chogovadze opened 3 years ago
@chogovadze , Thanks for outlining the steps here. I was having the same issues described here and followed the steps to fix the TF version and CUDA version incompatibility. After finishing these steps I got an error when I tried to run superpoint (script export_detections.py):
ImportError: No module named 'superpoint'
Following the thread https://github.com/rpautrat/SuperPoint/issues/206 I did another round of make install. It finished fine but I am still getting the same error. Any ideas?
I realized my error - I was pointing to my earlier venv in the makefile. After I removed that, I reran make install (which reinstalled superpoint). However, now I am getting the following error when I try to run the export_detections.py script:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.
For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.
Thanks. Solve my issue with loss nan, precision nan, recall 0.0000 on RTX 3090.
对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。
谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。
你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢
对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。
谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。
你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢
我在训练magicpoint的时候也遇到了loss nan的问题,请问您解决了吗?可以加QQ 972048746联系一下
Hello everyone, It seems that several users are reporting the same kind of obstacles with regards to training/predicting. After research, this problem appears to be a compatibility issue of old versions of tensorflow 1.x and newer GPUs when installing through pip. Compiling tensorflow from source resolves this issue however it is very time-consuming. I hope this write up could help other users that are having trouble with their environment.
This method requires the use of conda.
conda install tensorflow-gpu=1.12
(conda will automatically pull the correct cuda/cudnn versions).tensorflow-gpu==1.12
fromrequirement.txt
and run the makefile.batch_size
andeval_batch_size
in the config files to 1.export TF_FORCE_GPU_ALLOW_GROWTH=true
followed byexport TMPDIR=/tmp/
in your current terminal session.If you are still having issues be sure that you have NOT:
conda install cudnn=x.x.x=cudax.x_x
.References from:
21
35
48
96
148
149
I have successfully worked with this repository with the following setup:
If you are still having some issues, please do not hesitate to reach out.