microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.42k stars 4.11k forks source link

Something get wrong when run “aio_” and "gds_" file(DeepNVMe) #6567

Closed niebowen666 closed 2 weeks ago

niebowen666 commented 1 month ago

Describe the bug I couldn't run DeepNVMe demo properly. It shows:

collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

It seems that something wrong about ninja, there is short of "build.ninja". Anybody suffer this situation?

ds_report output [2024-09-24 17:35:01,773] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_lion ............... [NO] ....... [OKAY] [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH evoformer_attn ......... [NO] ....... [NO] [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures fp_quantizer ........... [NO] ....... [NO] fused_lamb ............. [NO] ....... [OKAY] fused_lion ............. [NO] ....... [OKAY] gds .................... [NO] ....... [OKAY] [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures inference_core_ops ..... [NO] ....... [NO] [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures cutlass_ops ............ [NO] ....... [NO] [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures transformer_inference .. [NO] ....... [NO] quantizer .............. [NO] ....... [OKAY] [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures ragged_device_ops ...... [NO] ....... [NO] [WARNING] NVIDIA Inference is only supported on Pascal and newer architectures ragged_ops ............. [NO] ....... [NO] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4 [WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch'] torch version .................... 2.4.1+cu121 deepspeed install path ........... ['/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.15.1, unknown, unknown torch cuda version ............... 12.1 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1 shared memory (/dev/shm) size .... 125.75 GB

System info (please complete the following information):

conda list

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge _sysroot_linux-64_curr_repodata_hack 3 h69a702a_16 conda-forge annotated-types 0.7.0 pypi_0 pypi binutils_impl_linux-64 2.40 ha1999f0_7 conda-forge binutils_linux-64 2.40 hb3c18ed_3 conda-forge bzip2 1.0.8 h4bc722e_7 conda-forge c-ares 1.19.1 h5eee18b_0 anaconda ca-certificates 2024.8.30 hbcca054_0 conda-forge cmake 3.26.4 h96355d8_0 anaconda cuda 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-cccl 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-command-line-tools 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-compiler 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-cudart 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-cudart-dev 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-cudart-static 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-cuobjdump 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-cupti 12.1.62 0 nvidia/label/cuda-12.1.0 cuda-cupti-static 12.1.62 0 nvidia/label/cuda-12.1.0 cuda-cuxxfilt 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-demo-suite 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-documentation 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-driver-dev 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-gdb 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-libraries 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-libraries-dev 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-libraries-static 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-nsight 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nsight-compute 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-nvcc 12.1.105 0 nvidia cuda-nvdisasm 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nvml-dev 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nvprof 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nvprune 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nvrtc 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nvrtc-dev 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nvrtc-static 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-nvtx 12.1.66 0 nvidia/label/cuda-12.1.0 cuda-nvvp 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-opencl 12.1.56 0 nvidia/label/cuda-12.1.0 cuda-opencl-dev 12.1.56 0 nvidia/label/cuda-12.1.0 cuda-profiler-api 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-runtime 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-sanitizer-api 12.1.55 0 nvidia/label/cuda-12.1.0 cuda-toolkit 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-tools 12.1.0 0 nvidia/label/cuda-12.1.0 cuda-visual-tools 12.1.0 0 nvidia/label/cuda-12.1.0 deepspeed 0.15.1 pypi_0 pypi expat 2.6.3 h6a678d5_0 anaconda filelock 3.16.1 pypi_0 pypi fsspec 2024.9.0 pypi_0 pypi gcc 14.1.0 h6f9ffa1_1 conda-forge gcc_impl_linux-64 14.1.0 h3c94d91_1 conda-forge gcc_linux-64 14.1.0 h3f71edc_3 conda-forge gds-tools 1.6.0.25 0 nvidia/label/cuda-12.1.0 gxx 14.1.0 h6f9ffa1_1 conda-forge gxx_impl_linux-64 14.1.0 h8d00ecb_1 conda-forge gxx_linux-64 14.1.0 hc55ae77_3 conda-forge hjson 3.1.0 pypi_0 pypi jinja2 3.1.4 pypi_0 pypi kernel-headers_linux-64 3.10.0 h4a8ded7_16 conda-forge krb5 1.20.1 h143b758_1 anaconda ld_impl_linux-64 2.40 hf3520f5_7 conda-forge libcublas 12.1.0.26 0 nvidia/label/cuda-12.1.0 libcublas-dev 12.1.0.26 0 nvidia/label/cuda-12.1.0 libcublas-static 12.1.0.26 0 nvidia/label/cuda-12.1.0 libcufft 11.0.2.4 0 nvidia/label/cuda-12.1.0 libcufft-dev 11.0.2.4 0 nvidia/label/cuda-12.1.0 libcufft-static 11.0.2.4 0 nvidia/label/cuda-12.1.0 libcufile 1.6.0.25 0 nvidia/label/cuda-12.1.0 libcufile-dev 1.6.0.25 0 nvidia/label/cuda-12.1.0 libcufile-static 1.6.0.25 0 nvidia/label/cuda-12.1.0 libcurand 10.3.2.56 0 nvidia/label/cuda-12.1.0 libcurand-dev 10.3.2.56 0 nvidia/label/cuda-12.1.0 libcurand-static 10.3.2.56 0 nvidia/label/cuda-12.1.0 libcurl 7.88.1 h251f7ec_2 anaconda libcusolver 11.4.4.55 0 nvidia/label/cuda-12.1.0 libcusolver-dev 11.4.4.55 0 nvidia/label/cuda-12.1.0 libcusolver-static 11.4.4.55 0 nvidia/label/cuda-12.1.0 libcusparse 12.0.2.55 0 nvidia/label/cuda-12.1.0 libcusparse-dev 12.0.2.55 0 nvidia/label/cuda-12.1.0 libcusparse-static 12.0.2.55 0 nvidia/label/cuda-12.1.0 libedit 3.1.20230828 h5eee18b_0 anaconda libev 4.33 h7f8727e_1 anaconda libffi 3.4.2 h7f98852_5 conda-forge libgcc 14.1.0 h77fa898_1 conda-forge libgcc-devel_linux-64 14.1.0 h5d3d1c9_101 conda-forge libgcc-ng 14.1.0 h69a702a_1 conda-forge libgomp 14.1.0 h77fa898_1 conda-forge libnghttp2 1.57.0 h2d74bed_0 anaconda libnpp 12.0.2.50 0 nvidia/label/cuda-12.1.0 libnpp-dev 12.0.2.50 0 nvidia/label/cuda-12.1.0 libnpp-static 12.0.2.50 0 nvidia/label/cuda-12.1.0 libnsl 2.0.1 hd590300_0 conda-forge libnvjitlink 12.1.55 0 nvidia/label/cuda-12.1.0 libnvjitlink-dev 12.1.55 0 nvidia/label/cuda-12.1.0 libnvjpeg 12.1.0.39 0 nvidia/label/cuda-12.1.0 libnvjpeg-dev 12.1.0.39 0 nvidia/label/cuda-12.1.0 libnvjpeg-static 12.1.0.39 0 nvidia/label/cuda-12.1.0 libnvvm-samples 12.1.55 0 nvidia/label/cuda-12.1.0 libsanitizer 14.1.0 hcba0ae0_1 conda-forge libsqlite 3.45.2 h2797004_0 conda-forge libssh2 1.11.0 h251f7ec_0 anaconda libstdcxx 14.1.0 hc0a3c3a_1 conda-forge libstdcxx-devel_linux-64 14.1.0 h5d3d1c9_101 conda-forge libstdcxx-ng 14.1.0 h4852527_1 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libuv 1.48.0 h5eee18b_0 anaconda libxcrypt 4.4.36 hd590300_1 conda-forge libzlib 1.2.13 h4ab18f5_6 conda-forge lz4-c 1.9.4 h6a678d5_1 anaconda markupsafe 2.1.5 pypi_0 pypi mpmath 1.3.0 pypi_0 pypi ncurses 6.4 h6a678d5_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main networkx 3.2.1 pypi_0 pypi ninja 1.11.1.1 pypi_0 pypi nsight-compute 2023.1.0.15 0 nvidia/label/cuda-12.1.0 numpy 2.0.2 pypi_0 pypi nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi nvidia-curand-cu12 10.3.2.106 pypi_0 pypi nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi nvidia-ml-py 12.560.30 pypi_0 pypi nvidia-nccl-cu12 2.20.5 pypi_0 pypi nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi nvidia-nvtx-cu12 12.1.105 pypi_0 pypi openssl 3.3.2 hb9d3cd8_0 conda-forge packaging 24.1 pypi_0 pypi pillow 10.4.0 pypi_0 pypi pip 24.2 py39h06a4308_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main psutil 6.0.0 pypi_0 pypi py-cpuinfo 9.0.0 pypi_0 pypi pydantic 2.9.2 pypi_0 pypi pydantic-core 2.23.4 pypi_0 pypi python 3.9.18 h0755675_1_cpython conda-forge readline 8.2 h5eee18b_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main rhash 1.4.3 hdbd6064_0 anaconda setuptools 75.1.0 py39h06a4308_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main sqlite 3.45.2 h2c6b66d_0 conda-forge sympy 1.13.3 pypi_0 pypi sysroot_linux-64 2.17 h4a8ded7_16 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge torch 2.4.1 pypi_0 pypi torchaudio 2.4.1 pypi_0 pypi torchvision 0.19.1 pypi_0 pypi tqdm 4.66.5 pypi_0 pypi triton 3.0.0 pypi_0 pypi typing-extensions 4.12.2 pypi_0 pypi tzdata 2024a h04d1e81_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main wheel 0.44.0 py39h06a4308_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main xz 5.4.6 h5eee18b_1 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main zlib 1.2.13 h4ab18f5_6 conda-forge zstd 1.5.5 hc292b87_2 anaconda

jomayeri commented 1 month ago

What command did you run?

niebowen666 commented 1 month ago

The commands are "python aio_store_cpu_tensor.py --nvme_folder tensor/" and "python gds_store_gpu_tensor.py --nvme_folder tensor/"

niebowen666 commented 1 month ago

@jomayeri

niebowen666 commented 1 month ago

The error information:

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/1] /root/anaconda3/envs/deepspeed/bin/x86_64-conda-linux-gnu-c++ deepspeed_py_io_handle.o deepspeed_py_aio.o deepspeed_py_aio_handle.o deepspeed_aio_thread.o deepspeed_aio_utils.o deepspeed_aio_common.o deepspeed_aio_types.o deepspeed_cpu_op.o deepspeed_aio_op_desc.o deepspeed_py_copy.o deepspeed_pin_tensor.o py_ds_aio.o -shared -L/root/anaconda3/envs/deepspeed -L/root/anaconda3/envs/deepspeed/lib64 -laio -lcuda -lcudart -L/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o async_io.so FAILED: async_io.so /root/anaconda3/envs/deepspeed/bin/x86_64-conda-linux-gnu-c++ deepspeed_py_io_handle.o deepspeed_py_aio.o deepspeed_py_aio_handle.o deepspeed_aio_thread.o deepspeed_aio_utils.o deepspeed_aio_common.o deepspeed_aio_types.o deepspeed_cpu_op.o deepspeed_aio_op_desc.o deepspeed_py_copy.o deepspeed_pin_tensor.o py_ds_aio.o -shared -L/root/anaconda3/envs/deepspeed -L/root/anaconda3/envs/deepspeed/lib64 -laio -lcuda -lcudart -L/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o async_io.so /root/anaconda3/envs/deepspeed/bin/../lib/gcc/x86_64-conda-linux-gnu/14.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcuda: No such file or directory collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 40, in main() File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 18, in main aio_handle = AsyncIOBuilder().load().aio_handle() File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load return self.jit_load(verbose) File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1312, in load return _jit_compile( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'async_io'

jomayeri commented 1 month ago

Based on the output it looks like the compilation can't link to the cuda library cannot find -lcuda: No such file or directory. Your ds_report shows CUDA installed, you might try setting the CUDA_HOME environment variable to point to the location of the cuda install and rebuilding.

niebowen666 commented 1 month ago

@jomayeri Since I installed DeepSpeed by using anaconda. I have configured the environment variable CUDA_HOME as /root/anaconda3/envs/deepspeed/

echo $CUDA_HOME /root/anaconda3/envs/deepspeed/ the virtual environment built in conda named deepspeed

Is that right?

Besides, I also set the CUDNN_HOME to /root/anaconda3/envs/deepspeed/. It‘s useless, too.

The error is: FAILED: async_io.so /root/anaconda3/envs/deepspeed/bin/x86_64-conda-linux-gnu-c++ deepspeed_py_io_handle.o deepspeed_py_aio.o deepspeed_py_aio_handle.o deepspeed_aio_thread.o deepspeed_aio_utils.o deepspeed_aio_common.o deepspeed_aio_types.o deepspeed_cpu_op.o deepspeed_aio_op_desc.o deepspeed_py_copy.o deepspeed_pin_tensor.o py_ds_aio.o -shared -L/root/anaconda3/envs/deepspeed/ -L/root/anaconda3/envs/deepspeed/lib64 -laio -lcuda -lcudart -L/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o async_io.so /root/anaconda3/envs/deepspeed/bin/../lib/gcc/x86_64-conda-linux-gnu/14.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcuda: No such file or directory collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 40, in main() File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 18, in main aio_handle = AsyncIOBuilder().load().aio_handle() File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load return self.jit_load(verbose) File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1312, in load return _jit_compile( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'async_io'

jomayeri commented 1 month ago

No Cuda is not installed in DeepSpeed. Typically it is stored in /usr/local you can run whereis cuda to find it. Do commands like nvidia-smi work on the system?

niebowen666 commented 1 month ago

@jomayeri Yeah~ I found cuda in /usr/local when I ran whereis cuda

whereis cuda
cuda: /usr/lib/cuda

Then I configured the CUDA_HOME and CUDANN_HOME as /usr/lib/cuda:

export CUDA_HOME=/usr/lib/cuda
export CUDNN_HOME=/usr/lib/cuda
source ~/.bashrc

echo $CUDA_HOME
/usr/lib/cuda
echo $CUDNN_HOME
/usr/lib/cuda

Finally, I ran the command python aio_store_cpu_tensor.py --nvme_folder tensor/ It give me a new feedback:

[2024-09-27 08:46:29,890] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 3, in <module>
    from deepspeed.ops.op_builder import AsyncIOBuilder
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 53, in installed_cuda_version
    output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/cuda/bin/nvcc'

It seems that it's no nvcc in directory "cuda"?

/usr/lib/cuda
bin  include  lib64  nvvm  version.txt
ls /usr/lib/cuda/bin
there is nothing

Besides, I also found cuda in /usr/local

ls /usr/local
bin  cuda-12.1  etc  games  include  kernelobjects  lib  man  mysql  pgsql  sbin  share  src  ssl

I reset theCUDA_HOME and CUDNN_HOME environment variable to/usr/local/cuda-12.1, and runpython aio_store_cpu_tensor.py --nvme_folder tensor/ The same error happen:

[2024-09-27 09:06:52,673] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 3, in <module>
    from deepspeed.ops.op_builder import AsyncIOBuilder
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 53, in installed_cuda_version
    output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda-12.1//bin/nvcc'

It also effect ds_report: ds_report

[2024-09-27 09:05:40,449] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/root/anaconda3/envs/deepspeed/bin/ds_report", line 3, in <module>
    from deepspeed.env_report import cli_main
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 53, in installed_cuda_version
    output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda-12.1//bin/nvcc'

My nvcc information as below:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

nvidia-smi

Fri Sep 27 09:09:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX TITAN X     Off |   00000000:86:00.0 Off |                  N/A |
| 18%   47C    P0             44W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
niebowen666 commented 1 month ago

The cuda is not installed peoperly, right?

 ls /usr/local/cuda-12.1/
nsight-compute-2023.1.0  nsight-systems-2023.1.2  nvvm  targets
niebowen666 commented 1 month ago
ls /usr/lib/cuda/
bin  include  lib64  nvvm  version.txt
ls /usr/lib/cuda/bin
nothing
ls /usr/lib/cuda/include/
nothing
ls /usr/lib/cuda/lib64/
nothing
ls /usr/lib/cuda/nvvm/
libdevice
jomayeri commented 1 month ago

Yes it looks like you should reinstall cuda.

niebowen666 commented 1 month ago

It seems that I have reinstalled cuda properly ls /usr/local/ bin cuda cuda-12.1 etc games include kernelobjects lib man mysql pgsql sbin share src ssl ls /usr/local/cuda-12.1/ bin DOCS extras gds-12.1 lib64 nsight-compute-2023.1.0 nsight-systems-2023.1.2 nvvm share targets version.json compute-sanitizer EULA.txt gds include libnvvp nsightee_plugins nvml README src tools

And I have configured the CUDA_HOME successfully, too.

echo $CUDA_HOME
/usr/local/cuda-12.1

But I still got the error: FAILED: async_io.so /root/anaconda3/envs/deepspeed/bin/x86_64-conda-linux-gnu-c++ deepspeed_py_io_handle.o deepspeed_py_aio.o deepspeed_py_aio_handle.o deepspeed_aio_thread.o deepspeed_aio_utils.o deepspeed_aio_common.o deepspeed_aio_types.o deepspeed_cpu_op.o deepspeed_aio_op_desc.o deepspeed_py_copy.o deepspeed_pin_tensor.o py_ds_aio.o -shared -L/usr/local/cuda-12.1 -L/usr/local/cuda-12.1/lib64 -laio -lcuda -lcudart -L/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o async_io.so /root/anaconda3/envs/deepspeed/bin/../lib/gcc/x86_64-conda-linux-gnu/14.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcuda: No such file or directory collect2: error: ld returned 1 exit status ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build subprocess.run( File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 528, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 40, in main() File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 18, in main aio_handle = AsyncIOBuilder().load().aio_handle() File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load return self.jit_load(verbose) File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load op_module = load(name=self.name, File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1312, in load return _jit_compile( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile _write_ninja_file_and_build_library( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library _run_ninja_build( File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'async_io'

I find that the lib64 in the path /usr/local/cuda-12.1/ is a softlink, is that right?

ll /usr/local/cuda-12.1/
total 144
drwxr-xr-x 17 root root  4096 10月  9 10:35 ./
drwxr-xr-x 15 root root  4096 10月  9 10:34 ../
drwxr-xr-x  3 root root  4096 10月  9 10:35 bin/
drwxr-xr-x  5 root root  4096 10月  9 10:34 compute-sanitizer/
-rw-r--r--  1 root root   160 10月  9 10:35 DOCS
-rw-r--r--  1 root root 61498 10月  9 10:35 EULA.txt
drwxr-xr-x  5 root root  4096 10月  9 10:35 extras/
drwxr-xr-x  6 root root  4096 10月  9 10:34 gds/
drwxr-xr-x  2 root root  4096 10月  9 10:34 gds-12.1/
lrwxrwxrwx  1 root root    28 10月  9 10:35 include -> targets/x86_64-linux/include/
lrwxrwxrwx  1 root root    24 10月  9 10:35 lib64 -> targets/x86_64-linux/lib/
drwxr-xr-x  7 root root  4096 10月  9 10:35 libnvvp/
drwxr-xr-x  7 root root  4096 10月  9 10:35 nsight-compute-2023.1.0/
drwxr-xr-x  2 root root  4096 10月  9 10:34 nsightee_plugins/
drwxr-xr-x  6 root root  4096 10月  9 10:35 nsight-systems-2023.1.2/
drwxr-xr-x  3 root root  4096 10月  9 10:34 nvml/
drwxr-xr-x  7 root root  4096 10月  9 10:35 nvvm/
-rw-r--r--  1 root root   524 10月  9 10:35 README
drwxr-xr-x  3 root root  4096 10月  9 10:34 share/
drwxr-xr-x  2 root root  4096 10月  9 10:34 src/
drwxr-xr-x  3 root root  4096 10月  9 10:34 targets/
drwxr-xr-x  2 root root  4096 10月  9 10:35 tools/
-rw-r--r--  1 root root  2928 10月  9 10:34 version.json

Do you konw how to solve the issuse? Thank you! @jomayeri

jomayeri commented 1 month ago

It still cannot find cuda based on this error /root/anaconda3/envs/deepspeed/bin/../lib/gcc/x86_64-conda-linux-gnu/14.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcuda:

The linker appears to be in the anaconda environment, is cuda in the anaconda environment?

niebowen666 commented 1 month ago

Yeah, I have configured my environment in anaconda. But currently the linker link to the path/usr/local/cuda-12.1/targets/x86_64-linux/lib/

Should I changed the CUDA_HOME configuration?

And I have tried to run the command ln -s /root/anaconda3/envs/deepspeed/lib lib64 and ln -s /root/anaconda3/envs/deepspeed/include include to change the linker. But it still no use.

jomayeri commented 3 weeks ago

I'm not familiar with Anaconda best practices, but you should ensure the linker has the correct path to CUDA.