Open chogovadze opened 4 years ago
@chogovadze , Thanks for outlining the steps here. I was having the same issues described here and followed the steps to fix the TF version and CUDA version incompatibility. After finishing these steps I got an error when I tried to run superpoint (script export_detections.py):
ImportError: No module named 'superpoint'
Following the thread https://github.com/rpautrat/SuperPoint/issues/206 I did another round of make install. It finished fine but I am still getting the same error. Any ideas?
I realized my error - I was pointing to my earlier venv in the makefile. After I removed that, I reran make install (which reinstalled superpoint). However, now I am getting the following error when I try to run the export_detections.py script:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.
For those who have difficulties running on GPUs that can't match lower version CUDA (3080 in my case), try switching to NVIDIA's TensorFlow repo https://github.com/NVIDIA/tensorflow#install this solves my issue.
Thanks. Solve my issue with loss nan, precision nan, recall 0.0000 on RTX 3090.
对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。
谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。
你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢
对于那些在无法与较低版本的 CUDA(就我而言为 3080)相匹配的 GPU 上运行困难的人,请尝试切换到 NVIDIA 的 TensorFlow 存储库 https://github.com/NVIDIA/tensorflow#install 这样可以解决我的问题。
谢谢。解决我在 RTX 3090 上的损失 nan、精度 nan、召回 0.0000 的问题。
你好,我的是RTX3080Ti,请问你的训练成功了吗?希望可以联系一下,可以相互学习学习,感谢
我在训练magicpoint的时候也遇到了loss nan的问题,请问您解决了吗?可以加QQ 972048746联系一下
Hi,
Thanks for sharing the solution.
I successfully run the training of MagicPoint.
However, in the end I got loss=nan, precision=nan, recall=0.0000
issue.
Do you know what might be the reason?
Hello everyone, It seems that several users are reporting the same kind of obstacles with regards to training/predicting. After research, this problem appears to be a compatibility issue of old versions of tensorflow 1.x and newer GPUs when installing through pip. Compiling tensorflow from source resolves this issue however it is very time-consuming. I hope this write up could help other users that are having trouble with their environment.
This method requires the use of conda.
conda install tensorflow-gpu=1.12
(conda will automatically pull the correct cuda/cudnn versions).tensorflow-gpu==1.12
fromrequirement.txt
and run the makefile.batch_size
andeval_batch_size
in the config files to 1.export TF_FORCE_GPU_ALLOW_GROWTH=true
followed byexport TMPDIR=/tmp/
in your current terminal session.If you are still having issues be sure that you have NOT:
conda install cudnn=x.x.x=cudax.x_x
.References from:
21
35
48
96
148
149
I have successfully worked with this repository with the following setup:
If you are still having some issues, please do not hesitate to reach out.