open-mmlab / OpenUnReID

PyTorch open-source toolbox for unsupervised or domain adaptive object re-ID.
Apache License 2.0
397 stars 67 forks source link

Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::runL2Norm( #7

Closed 944284742 closed 4 years ago

944284742 commented 4 years ago

我的环境是ubuntu18.04, pytorch1.5.0 cuda10.1,运行时报错如下: 我执行的训练指令是: GPUS=1 bash dist_train.sh SpCL SpCL/Market1501

bruteForceKnn is deprecated; call bfKnn instead Faiss assertion 'err == cudaSuccess' failed in void faiss::gpu::runL2Norm(faiss::gpu::Tensor<T, 2, true, IndexType>&, bool, faiss::gpu::Tensor<float, 1, true, IndexType>&, bool, cudaStream_t) [with T = float; TVec = float4; IndexType = int; cudaStream_t = CUstream_st*] at gpu/impl/L2Norm.cu:292; details: CUDA error 11 invalid argument Traceback (most recent call last): File "/my_app/anaconda3/envs/OpenUnReID-pytorch1.5-py3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main__", mod_spec) File "/my_app/anaconda3/envs/OpenUnReID-pytorch1.5-py3.6/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/my_app/anaconda3/envs/OpenUnReID-pytorch1.5-py3.6/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in main() File "/my_app/anaconda3/envs/OpenUnReID-pytorch1.5-py3.6/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/my_app/anaconda3/envs/OpenUnReID-pytorch1.5-py3.6/bin/python', '-u', 'SpCL/main.py', 'SpCL/config.yaml', '--work-dir=SpCL/Market1501', '--launcher=pytorch', '--tcp-port=28211', '--set']' died with <Signals.SIGABRT: 6>.

yxgeee commented 4 years ago

It seems the issue with faiss, try to re-install it again with

conda install faiss-gpu cudatoolkit=10.0 -c pytorch
ZiaIsHere commented 2 years ago

@944284742 Yes, I resolved this issue by moving from GPU to CPU during training search_type: 3 # 0,1,2 for GPU, 3 for CPU (work for faiss)

QinHsiu commented 1 year ago

I have the same problem, 'err__ == cudaSuccess' failed in void faiss::gpu::runL2Norm(faiss::gpu::Tensor<T, 2, true, IndexType>&, bool, faiss::gpu::Tensor<float, 1, true, IndexType>&, bool, cudaStream_t) [with T = float; TVec = float4; IndexType = int; cudaStream_t = CUstream_st*] at /root/miniconda3/conda-bld/faiss-pkg_1669821803039/work/faiss/gpu/impl/L2Norm.cu:323; details: CUDA error 209 no kernel image is available for execution on the device Aborted (core dumped)

madongdong1005 commented 10 months ago

Faiss assertion 'err__ == cudaSuccess' failed in void faiss::gpu::runL2Norm<T, TVec>(faiss::gpu::Tensor<T, 2, true, long long, faiss::gpu::traits::DefaultPtrTraits> &, bool, faiss::gpu::Tensor<float, 1, true, long long, faiss::gpu::traits::DefaultPtrTraits> &, bool, CUstream_st *) at D:/bld/faiss-split_1685360948441/work/faiss/gpu/impl/L2Norm.cu:300; details: CUDA error 209 no kernel image is available for execution on the device

blaz-r commented 9 months ago

Another solution that worked for me (although not for mmlab code) was to move to an older version of faiss. Specifically 1.6.5.

madongdong1005 commented 9 months ago

Thank you for your enthusiastic help, I seem to have seen the faiss readme file on github before, and I have to be under linux to solve this problem

At 2024-02-24 21:31:28, "Blaž Rolih" @.***> wrote:

Another solution that worked for me (although not for mmlab code) was to move to an older version of faiss. Specifically 1.6.5.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>