microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.37k stars 4.1k forks source link

[BUG] Trying to finetune mistral using deepspeed but running into an error: Error building extension 'cpu_adam' #5429

Closed SarthakM320 closed 2 months ago

SarthakM320 commented 6 months ago
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.26it/s]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA RTX A6000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[rank: 0] Seed set to 4
initializing deepspeed distributed: GLOBAL_RANK: 0, MEMBER: 1/1
Enabling DeepSpeed BF16. Model parameters and inputs will be cast to `bfloat16`.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1,3]
Installed CUDA version 11.5 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination
Using /home/sarthak/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sarthak/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/bin/nvcc --generate-dependencies-with-compile --dependency-output custom_cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/TH -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/THC -isystem /home/sarthak/miniconda3/envs/tmi/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
FAILED: custom_cuda_kernel.cuda.o 
/usr/bin/nvcc --generate-dependencies-with-compile --dependency-output custom_cuda_kernel.cuda.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/TH -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/THC -isystem /home/sarthak/miniconda3/envs/tmi/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ --threads=8 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -c /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
  435 |         function(_Functor&& __f)
      |                                                                                                                                                 ^ 
/usr/include/c++/11/bits/std_function.h:435:145: note:         ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
  530 |         operator=(_Functor&& __f)
      |                                                                                                                                                  ^ 
/usr/include/c++/11/bits/std_function.h:530:146: note:         ‘_ArgTypes’
[2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/TH -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/THC -isystem /home/sarthak/miniconda3/envs/tmi/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
FAILED: cpu_adam_impl.o 
c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/TH -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/THC -isystem /home/sarthak/miniconda3/envs/tmi/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
In file included from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/util/TypeList.h:3:0,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/util/Metaprogramming.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/core/DispatchKeySet.h:4,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/core/Backend.h:5,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/core/Layout.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:12,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/ATen/core/Tensor.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/ATen/Tensor.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/function_hook.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/cpp_hook.h:2,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/variable.h:6,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/autograd.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/autograd.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/all.h:7,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/extension.h:5,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp:6:
/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/util/C++17.h:16:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later."
 #error \
  ^~~~~
[3/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/TH -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/THC -isystem /home/sarthak/miniconda3/envs/tmi/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
FAILED: cpu_adam.o 
c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -I/usr/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/TH -isystem /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/THC -isystem /home/sarthak/miniconda3/envs/tmi/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX256__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
In file included from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/util/TypeList.h:3:0,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/util/Metaprogramming.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/core/DispatchKeySet.h:4,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/core/Backend.h:5,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/core/Layout.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/ATen/core/TensorBody.h:12,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/ATen/core/Tensor.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/ATen/Tensor.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/function_hook.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/cpp_hook.h:2,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/variable.h:6,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/autograd/autograd.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/autograd.h:3,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/csrc/api/include/torch/all.h:7,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/torch/extension.h:5,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/includes/cpu_adam.h:12,
                 from /home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp:6:
/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/include/c10/util/C++17.h:16:2: error: #error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later."
 #error \
  ^~~~~
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build
    subprocess.run(
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/disks/2/sarthak/transformer_multi_image/main.py", line 102, in <module>
    main(vars(args))
  File "/disks/2/sarthak/transformer_multi_image/main.py", line 74, in main
    trainer.fit(model,datamodule=dataset)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
    return function(*args, **kwargs)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 963, in _run
    self.strategy.setup(self)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/strategies/deepspeed.py", line 353, in setup
    self.init_deepspeed()
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/strategies/deepspeed.py", line 454, in init_deepspeed
    self._initialize_deepspeed_train(self.model)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/strategies/deepspeed.py", line 486, in _initialize_deepspeed_train
    ) = self._init_optimizers()
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/strategies/deepspeed.py", line 460, in _init_optimizers
    optimizers, lr_schedulers = _init_optimizers_and_lr_schedulers(self.lightning_module)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/core/optimizer.py", line 178, in _init_optimizers_and_lr_schedulers
    optim_conf = call._call_lightning_module_hook(model.trainer, "configure_optimizers", pl_module=model)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/disks/2/sarthak/transformer_multi_image/model/arch.py", line 272, in configure_optimizers
    deepspeed.ops.op_builder.CPUAdamBuilder().load()
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 479, in load
    return self.jit_load(verbose)
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 523, in jit_load
    op_module = load(name=self.name,
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1306, in load
    return _jit_compile(
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1823, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2112, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'cpu_adam'

ds_report

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/torch']
torch version .................... 2.2.2+cu118
deepspeed install path ........... ['/home/sarthak/miniconda3/envs/tmi/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.14.0, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 503.87 GB

Any idea on how to solve this?

xuanhua commented 6 months ago

#error "You're trying to build PyTorch with a too old version of GCC. We need GCC 9 or later." You might need a newer version of gcc.

tdye24 commented 4 months ago

Same issue, no idea.

jeeyung commented 3 months ago

same issue

SarthakM320 commented 2 months ago

The problem was with the nvcc version so I changed that

loadams commented 2 months ago

@SarthakM320 we believe this is resolved with the latest release of DeepSpeed, please test this. If you are still hitting this, please comment and we can re-open this issue.