python -c 'import deepspeed; deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()'
[2024-06-14 14:24:07,747] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
Using /home/jxlab03/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Creating extension directory /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam...
Emitting ninja build file /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o
[3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
FAILED: cpu_adam.so
c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
/usr/bin/ld: 找不到 -lcurand
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
subprocess.run(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "", line 1, in
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 508, in load
return self.jit_load(verbose)
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 555, in jit_load
op_module = load(name=self.name,
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
but I use "ldconfig -p | grep libcurand" in terminal, is can see the ibcurand.so.
libcurand.so.10 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so.10
libcurand.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so
torch cuda version 和 nvcc version is match, is 11.8.
python -c 'import deepspeed; deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()' [2024-06-14 14:24:07,747] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible Using /home/jxlab03/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Creating extension directory /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam... Emitting ninja build file /home/jxlab03/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja... Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o [2/3] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/TH -isystem /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/include/THC -isystem /home/jxlab03/anaconda3/envs/minicpm/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -DAVX512 -DENABLE_CUDA -DBF16_AVAILABLE -c /home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o [3/3] c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so FAILED: cpu_adam.so c++ cpu_adam.o cpu_adam_impl.o -shared -lcurand -L/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so /usr/bin/ld: 找不到 -lcurand collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build subprocess.run( File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "", line 1, in
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 508, in load
return self.jit_load(verbose)
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 555, in jit_load
op_module = load(name=self.name,
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
_write_ninja_file_and_build_library(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/home/jxlab03/anaconda3/envs/minicpm/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'
but I use "ldconfig -p | grep libcurand" in terminal, is can see the ibcurand.so. libcurand.so.10 (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so.10 libcurand.so (libc6,x86-64) => /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcurand.so
torch cuda version 和 nvcc version is match, is 11.8.
So, I don't konw why ninja can find -lcurand?