microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k stars 3.94k forks source link

[BUG] oneapi/ccl.hpp: No such file or directory. #5653

Open weiji14 opened 2 weeks ago

weiji14 commented 2 weeks ago

Describe the bug

The builds on conda-forge have been failing since deepspeed=0.14.1 for CUDA 11.8 and 12.0 with an error like fatal error: oneapi/ccl.hpp: No such file or directory. Originally reported at https://github.com/conda-forge/deepspeed-feedstock/pull/56#issuecomment-2062611899.

To Reproduce Steps to reproduce the behavior:

  1. Go to https://github.com/conda-forge/deepspeed-feedstock/pull/57 and clone the branch
  2. Run python build_locally.py locally, select the option with CUDA 11.8 and Python 3.9
  3. See error below

Expected behavior A clear and concise description of what you expected to happen.

CUDA builds work as expected.

ds_report output Please run ds_report to give us details about your setup.

Note, this isn't the exact report for the conda-forge CI device, I copied this from the CPU build logs

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [FAIL]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
deepspeed_ccl_comm ..... [NO] ....... [OKAY]
deepspeed_shm_comm ..... [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['$PREFIX/lib/python3.9/site-packages/torch']
torch version .................... 2.3.0.post101
deepspeed install path ........... ['$PREFIX/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.14.3, f492cfc, HEAD
deepspeed wheel compiled w. ...... torch 0.0 
shared memory (/dev/shm) size .... 64.00 MB

Screenshots

Truncated traceback from https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=953875&view=logs&j=bb1c2637-64c6-57bd-9ea6-93823b2df951&t=350df31b-3291-5209-0bb7-031395f0baa1&l=3486:

2024-06-12T22:49:04.3043574Z   building 'deepspeed.ops.comm.deepspeed_ccl_comm_op' extension
2024-06-12T22:49:04.3043994Z   creating build/temp.linux-x86_64-cpython-39
2024-06-12T22:49:04.3053293Z   creating build/temp.linux-x86_64-cpython-39/csrc
2024-06-12T22:49:04.3054113Z   creating build/temp.linux-x86_64-cpython-39/csrc/cpu
2024-06-12T22:49:04.3054729Z   creating build/temp.linux-x86_64-cpython-39/csrc/cpu/comm
2024-06-12T22:49:04.3071806Z   /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_build_env/bin/x86_64-conda-linux-gnu-cc -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -fPIC -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work=/usr/local/src/conda/deepspeed-0.14.3 -fdebug-prefix-map=/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p=/usr/local/src/conda-prefix -isystem /usr/local/cuda/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include -isystem /usr/local/cuda/include -fPIC -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/csrc/cpu/includes -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include/TH -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/lib/python3.9/site-packages/torch/include/THC -I/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/include/python3.9 -c csrc/cpu/comm/ccl.cpp -o build/temp.linux-x86_64-cpython-39/csrc/cpu/comm/ccl.o -O2 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=deepspeed_ccl_comm_op -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-06-12T22:49:08.2062484Z   csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory
2024-06-12T22:49:08.2067800Z       8 | #include <oneapi/ccl.hpp>
2024-06-12T22:49:08.2068222Z         |          ^~~~~~~~~~~~~~~~
2024-06-12T22:49:08.2068507Z   compilation terminated.
2024-06-12T22:49:08.2182741Z   error: command '/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_build_env/bin/x86_64-conda-linux-gnu-cc' failed with exit code 1
2024-06-12T22:49:08.6174937Z   error: subprocess-exited-with-error
2024-06-12T22:49:08.6176666Z   
2024-06-12T22:49:08.6188012Z   × python setup.py bdist_wheel did not run successfully.
2024-06-12T22:49:08.6227487Z   │ exit code: 1
2024-06-12T22:49:08.6240717Z   ╰─> See above for output.
2024-06-12T22:49:08.6252920Z   
2024-06-12T22:49:08.6264017Z   note: This error originates from a subprocess, and is likely not a problem with pip.
2024-06-12T22:49:08.6271330Z   full command: /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_p/bin/python -u -c '
2024-06-12T22:49:08.6272043Z   exec(compile('"'"''"'"''"'"'
2024-06-12T22:49:08.6277838Z   # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
2024-06-12T22:49:08.6283726Z   #
2024-06-12T22:49:08.6284428Z   # - It imports setuptools before invoking setup.py, to enable projects that directly
2024-06-12T22:49:08.6289287Z   #   import from `distutils.core` to work with newer packaging standards.
2024-06-12T22:49:08.6289949Z   # - It provides a clear error message when setuptools is not installed.
2024-06-12T22:49:08.6295383Z   # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
2024-06-12T22:49:08.6295837Z   #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
2024-06-12T22:49:08.6301077Z   #     manifest_maker: standard file '"'"'-c'"'"' not found".
2024-06-12T22:49:08.6307069Z   # - It generates a shim setup.py, for handling setup.cfg-only projects.
2024-06-12T22:49:08.6307810Z   import os, sys, tokenize
2024-06-12T22:49:08.6314125Z   
2024-06-12T22:49:08.6314907Z   try:
2024-06-12T22:49:08.6320956Z       import setuptools
2024-06-12T22:49:08.6321316Z   except ImportError as error:
2024-06-12T22:49:08.6325049Z       print(
2024-06-12T22:49:08.6326023Z           "ERROR: Can not execute `setup.py` since setuptools is not available in "
2024-06-12T22:49:08.6335095Z           "the build environment.",
2024-06-12T22:49:08.6335348Z           file=sys.stderr,
2024-06-12T22:49:08.6338543Z       )
2024-06-12T22:49:08.6338832Z       sys.exit(1)
2024-06-12T22:49:08.6339045Z   
2024-06-12T22:49:08.6339554Z   __file__ = %r
2024-06-12T22:49:08.6340070Z   sys.argv[0] = __file__
2024-06-12T22:49:08.6340336Z   
2024-06-12T22:49:08.6340562Z   if os.path.exists(__file__):
2024-06-12T22:49:08.6340835Z       filename = __file__
2024-06-12T22:49:08.6341059Z       with tokenize.open(__file__) as f:
2024-06-12T22:49:08.6341411Z           setup_py_code = f.read()
2024-06-12T22:49:08.6341621Z   else:
2024-06-12T22:49:08.6341993Z       filename = "<auto-generated setuptools caller>"
2024-06-12T22:49:08.6342280Z       setup_py_code = "from setuptools import setup; setup()"
2024-06-12T22:49:08.6342576Z   
2024-06-12T22:49:08.6342895Z   exec(compile(setup_py_code, filename, "exec"))
2024-06-12T22:49:08.6343569Z   '"'"''"'"''"'"' % ('"'"'/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' bdist_wheel -d /tmp/pip-wheel-v4pibtb1
2024-06-12T22:49:08.6344021Z   cwd: /home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/
2024-06-12T22:49:08.6344416Z   Building wheel for deepspeed (setup.py): finished with status 'error'
2024-06-12T22:49:08.6348857Z   ERROR: Failed building wheel for deepspeed
2024-06-12T22:49:08.6349126Z   Running setup.py clean for deepspeed
2024-06-12T22:49:08.6349635Z   Running command python setup.py clean
2024-06-12T22:49:10.9424015Z   No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2024-06-12T22:49:10.9498275Z   [WARNING] Torch did not find cuda available, if cross-compiling or running with cpu only you can ignore this message. Adding compute capability for Pascal, Volta, and Turing (compute capabilities 6.0, 6.1, 6.2)
2024-06-12T22:49:10.9509931Z   DS_BUILD_OPS=1
2024-06-12T22:49:17.9894660Z   Install Ops={'deepspeed_not_implemented': 1, 'deepspeed_ccl_comm': 1, 'deepspeed_shm_comm': 1, 'cpu_adam': 1, 'fused_adam': 1}
2024-06-12T22:49:18.0269777Z   version=0.14.3, git_hash=f492cfc, git_branch=HEAD
2024-06-12T22:49:18.0270870Z   install_requires=['hjson', 'ninja', 'numpy', 'nvidia-ml-py', 'packaging>=20.0', 'psutil', 'py-cpuinfo', 'pydantic', 'torch', 'tqdm']
2024-06-12T22:49:18.0278897Z   ext_modules=[<setuptools.extension.Extension('deepspeed.ops.comm.deepspeed_not_implemented_op') at 0x7ff248dbe460>, <setuptools.extension.Extension('deepspeed.ops.comm.deepspeed_ccl_comm_op') at 0x7ff248dbe4c0>, <setuptools.extension.Extension('deepspeed.ops.comm.deepspeed_shm_comm_op') at 0x7ff248dbe520>, <setuptools.extension.Extension('deepspeed.ops.adam.cpu_adam_op') at 0x7ff16dd51b80>, <setuptools.extension.Extension('deepspeed.ops.adam.fused_adam_op') at 0x7ff16dd51d90>]
2024-06-12T22:49:18.0651351Z   running clean
2024-06-12T22:49:18.0714575Z   removing 'build/temp.linux-x86_64-cpython-39' (and everything under it)
2024-06-12T22:49:18.0715596Z   removing 'build/lib.linux-x86_64-cpython-39' (and everything under it)
2024-06-12T22:49:18.1133919Z   'build/bdist.linux-x86_64' does not exist -- can't clean it
2024-06-12T22:49:18.1143315Z   'build/scripts-3.9' does not exist -- can't clean it
2024-06-12T22:49:18.1151899Z   removing 'build'
2024-06-12T22:49:18.1171105Z   deepspeed build time = 0.08735942840576172 secs
2024-06-12T22:49:18.5956605Z Failed to build deepspeed
2024-06-12T22:49:18.5973299Z ERROR: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects
2024-06-12T22:49:18.5973660Z Exception information:
2024-06-12T22:49:18.5986009Z Traceback (most recent call last):
2024-06-12T22:49:18.5993394Z   File "$PREFIX/lib/python3.9/site-packages/pip/_internal/cli/base_command.py", line 180, in exc_logging_wrapper
2024-06-12T22:49:18.5993851Z     status = run_func(*args)
2024-06-12T22:49:18.5994316Z   File "$PREFIX/lib/python3.9/site-packages/pip/_internal/cli/req_command.py", line 245, in wrapper
2024-06-12T22:49:18.5995399Z     return func(self, options, args)
2024-06-12T22:49:18.5999462Z   File "$PREFIX/lib/python3.9/site-packages/pip/_internal/commands/install.py", line 429, in run
2024-06-12T22:49:18.5999708Z     raise InstallationError(
2024-06-12T22:49:18.6006202Z pip._internal.exceptions.InstallationError: Could not build wheels for deepspeed, which is required to install pyproject.toml-based projects
2024-06-12T22:49:18.6010801Z Removed build tracker: '/tmp/pip-build-tracker-gvbib0oo'
2024-06-12T22:49:20.4656669Z Traceback (most recent call last):
2024-06-12T22:49:20.4664375Z   File "/opt/conda/bin/conda-build", line 11, in <module>
2024-06-12T22:49:20.4669893Z     sys.exit(execute())
2024-06-12T22:49:20.4670437Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/cli/main_build.py", line 590, in execute
2024-06-12T22:49:20.4677725Z     api.build(
2024-06-12T22:49:20.4678886Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/api.py", line 250, in build
2024-06-12T22:49:20.4685860Z     return build_tree(
2024-06-12T22:49:20.4691479Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 3638, in build_tree
2024-06-12T22:49:20.4708481Z     packages_from_this = build(
2024-06-12T22:49:20.4713969Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/build.py", line 2506, in build
2024-06-12T22:49:20.4714313Z     utils.check_call_env(
2024-06-12T22:49:20.4724506Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 405, in check_call_env
2024-06-12T22:49:20.4729616Z     return _func_defaulting_env_to_os_environ("call", *popenargs, **kwargs)
2024-06-12T22:49:20.4730205Z   File "/opt/conda/lib/python3.10/site-packages/conda_build/utils.py", line 381, in _func_defaulting_env_to_os_environ
2024-06-12T22:49:20.4735612Z     raise subprocess.CalledProcessError(proc.returncode, _args)
2024-06-12T22:49:20.4736400Z subprocess.CalledProcessError: Command '['/bin/bash', '-o', 'errexit', '/home/conda/feedstock_root/build_artifacts/deepspeed_1718232254780/work/conda_build.sh']' returned non-zero exit status 1.
2024-06-12T22:49:30.4784588Z 
2024-06-12T22:49:30.5793301Z ##[error]Bash exited with code '1'.
2024-06-12T22:49:30.5974127Z ##[section]Finishing: Run docker build

System info (please complete the following information):

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else? No

Docker context Are you using a specific docker image that you can share?

quay.io/condaforge/linux-anvil-cuda:11.8

Additional context Add any other context about the problem here.

The builds have been failing in these PRs as well:

loadams commented 2 weeks ago

Thanks @weiji14 for opening this to track.