microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k stars 3.94k forks source link

Bug Report: Issues Building DeepSpeed on Windows #5679

Closed Moemu closed 1 week ago

Moemu commented 1 week ago

Description:

I encountered some issues while building DeepSpeed on Windows systems. The generation process failed, it indicates that the folder already exists.

Environment:

OS: Windows 11
Python Version: 3.11
Conda Environment: Yes
DeepSpeed Version: lastest
CUDA Version: 12.3
PyTorch Version: 2.3.1+cu121

Steps to Reproduce:

Clone the DeepSpeed repository.
Navigate to the DeepSpeed directory.
Run the build script: build_win.bat

Error Log:

 (Neuro) C:\Muice-Vtuber\Neuro-master\DeepSpeed>build_win.bat
DS_BUILD_OPS=1
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
系统找不到指定的文件。
 [WARNING]  cpu_adam requires the 'lscpu' command, but it does not exist!
 [WARNING]  cpu_adam attempted to query 'lscpu' after failing to use py-cpuinfo to detect the CPU architecture. 'lscpu' does not appear to exist on your system, will fall back to use -march=native and non-vectorized execution.
系统找不到指定的文件。
 [WARNING]  cpu_adam requires the 'lscpu' command, but it does not exist!
 [WARNING]  cpu_adam attempted to query 'lscpu' after failing to use py-cpuinfo to detect the CPU architecture. 'lscpu' does not appear to exist on your system, will fall back to use -march=native and non-vectorized execution.
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
系统找不到指定的文件。
 [WARNING]  cpu_adagrad requires the 'lscpu' command, but it does not exist!
 [WARNING]  cpu_adagrad attempted to query 'lscpu' after failing to use py-cpuinfo to detect the CPU architecture. 'lscpu' does not appear to exist on your system, will fall back to use -march=native and non-vectorized execution.
系统找不到指定的文件。
 [WARNING]  cpu_adagrad requires the 'lscpu' command, but it does not exist!
 [WARNING]  cpu_adagrad attempted to query 'lscpu' after failing to use py-cpuinfo to detect the CPU architecture. 'lscpu' does not appear to exist on your system, will fall back to use -march=native and non-vectorized execution.
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
系统找不到指定的文件。
 [WARNING]  cpu_lion requires the 'lscpu' command, but it does not exist!
 [WARNING]  cpu_lion attempted to query 'lscpu' after failing to use py-cpuinfo to detect the CPU architecture. 'lscpu' does not appear to exist on your system, will fall back to use -march=native and non-vectorized execution.
系统找不到指定的文件。
 [WARNING]  cpu_lion requires the 'lscpu' command, but it does not exist!
 [WARNING]  cpu_lion attempted to query 'lscpu' after failing to use py-cpuinfo to detect the CPU architecture. 'lscpu' does not appear to exist on your system, will fall back to use -march=native and non-vectorized execution.
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.3
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Installed CUDA version 12.3 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination
Install Ops={'async_io': False, 'fused_adam': 1, 'cpu_adam': 1, 'cpu_adagrad': 1, 'cpu_lion': 1, 'evoformer_attn': False, 'fp_quantizer': False, 'fused_lamb': 1, 'fused_lion': 1, 'inference_core_ops': False, 'cutlass_ops': False, 'transformer_inference': False, 'quantizer': 1, 'ragged_device_ops': False, 'ragged_ops': 1, 'random_ltd': 1, 'sparse_attn': False, 'spatial_inference': 1, 'transformer': 1, 'stochastic_transformer': 1}
Traceback (most recent call last):
  File "C:\Muice-Vtuber\Neuro-master\DeepSpeed\setup.py", line 212, in <module>
    shutil.copytree('.\\csrc', '.\\deepspeed\\ops')
  File "C:\Users\Moemu\.conda\envs\Neuro\Lib\shutil.py", line 560, in copytree
    return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Moemu\.conda\envs\Neuro\Lib\shutil.py", line 459, in _copytree
    os.makedirs(dst, exist_ok=dirs_exist_ok)
  File "<frozen os>", line 225, in makedirs
FileExistsError: [WinError 183] 当文件已存在时,无法创建该文件。: '.\\deepspeed\\ops'
ycsgg commented 1 week ago

try to replace

shutil.copytree('.\\csrc', '.\\deepspeed\\ops') 
shutil.copytree('.\\op_builder', '.\\deepspeed\\ops')

with

shutil.copytree('.\\csrc', '.\\deepspeed\\ops\\csrc') 
shutil.copytree('.\\op_builder', '.\\deepspeed\\ops\\op_builder')

But I'm not sure if this will work well

Moemu commented 1 week ago

Thanks you. At the same time, I deleted files (.\\deepspeed\\accelerator, .\\deepspeed\\ops\\csrc and .\\deepspeed\\ops\\op_builder) and it could work.

But I met a new error :(

creating build\lib.win-amd64-cpython-311\deepspeed\inference\v2\ragged\csrc
copying deepspeed\inference\v2\ragged\csrc\fast_host_buffer.cu -> build\lib.win-amd64-cpython-311\deepspeed\inference\v2\ragged\csrc
copying deepspeed\inference\v2\ragged\csrc\ragged_ops.cpp -> build\lib.win-amd64-cpython-311\deepspeed\inference\v2\ragged\csrc
copying deepspeed\ops\sparse_attention\trsrc\matmul.tr -> build\lib.win-amd64-cpython-311\deepspeed\ops\sparse_attention\trsrc
copying deepspeed\ops\sparse_attention\trsrc\softmax_bwd.tr -> build\lib.win-amd64-cpython-311\deepspeed\ops\sparse_attention\trsrc
copying deepspeed\ops\sparse_attention\trsrc\softmax_fwd.tr -> build\lib.win-amd64-cpython-311\deepspeed\ops\sparse_attention\trsrc
running build_ext
C:\Users\Moemu\.conda\envs\Neuro\Lib\site-packages\torch\utils\cpp_extension.py:418: UserWarning: The detected CUDA version (12.3) has a minor version mismatch with the version that was used to compile PyTorch (12.1). Most likely this shouldn't be a problem.
  warnings.warn(CUDA_MISMATCH_WARN.format(cuda_str_version, torch.version.cuda))
building 'deepspeed.ops.adam.fused_adam_op' extension
creating build\temp.win-amd64-cpython-311
creating build\temp.win-amd64-cpython-311\Release
creating build\temp.win-amd64-cpython-311\Release\csrc
creating build\temp.win-amd64-cpython-311\Release\csrc\adam
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.40.33807\bin\Hostx64\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Muice-Vtuber\Neuro-master\DeepSpeed-master\csrc\includes -IC:\Muice-Vtuber\Neuro-master\DeepSpeed-master\csrc\adam -IC:\Users\Moemu\.conda\envs\Neuro\Lib\site-packages\torch\include -IC:\Users\Moemu\.conda\envs\Neuro\Lib\site-packages\torch\include\torch\csrc\api\include -IC:\Users\Moemu\.conda\envs\Neuro\Lib\site-packages\torch\include\TH -IC:\Users\Moemu\.conda\envs\Neuro\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\include" -IC:\Users\Moemu\.conda\envs\Neuro\include -IC:\Users\Moemu\.conda\envs\Neuro\Include /EHsc /Tpcsrc/adam/fused_adam_frontend.cpp /Fobuild\temp.win-amd64-cpython-311\Release\csrc/adam/fused_adam_frontend.obj /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -O2 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=fused_adam_op -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
fused_adam_frontend.cpp
C:\Users\Moemu\.conda\envs\Neuro\Lib\site-packages\torch\include\c10/core/DeviceType.h(10): fatal error C1083: 无法打开 包括文件: “cstddef”: No such file or directory
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.40.33807\\bin\\Hostx64\\x64\\cl.exe' failed with exit code 2
costin-eseanu commented 1 week ago

@Moemu, it looks like MSVC can't find cstddef, which is a standard C++ include file. Please make sure to run build_win.bat from a "Developer Command Prompt for VS 2022" which sets the correct environment variables for the compiler. In addition, you can build the costineseanu/windows_inference_build branch which has more fixes for the Windows build (including the one about not being able to copy files).

ChangxingJiang commented 1 week ago

Thank you. Change the code in setup.py and deletes the 3 files could work. I find this change commit in https://github.com/microsoft/DeepSpeed/pull/5596 and shutil.copytree cannot cover the exists file.