microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k stars 3.94k forks source link

RuntimeError: Error building extension 'cpu_adam', because /usr/bin/ld: can not find -lcurand,help! #5659

Open hekaijie123 opened 2 weeks ago

hekaijie123 commented 2 weeks ago

python -c 'import deepspeed; deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()' [2024-06-14 14:24:07,747] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible Using /home/jxlab03/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Creating extension directory /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam... Emitting ninja build file /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o [2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o [3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so FAILED: cpu_adam.so c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so /usr/bin/ld: 找不到 -lcurand collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 508, in load return self.jit_load(verbose) File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 555, in jit_load op_module = load(name=self.name, File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load return _jit_compile( File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile _write_ninja_file_and_build_library( File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library _run_ninja_build( File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'cpu_adam'

but I use "ldconfig -p | grep libcurand" in terminal, is can see the ibcurand.so. libcurand.so.10 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so.10 libcurand.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so

torch cuda version 和 nvcc version is match, is 11.8.

So, I don't konw why ninja can find -lcurand?

Mr-lonely0 commented 1 week ago

same problem. Have you solved it?