Ubuntu 22.04 : build fails in build_deepspeed_rocm.sh (missing oneapi/ccl.hpp)

meso-uca commented 1 week ago

OS : ubuntu 22.04 (ubuntu 22.04 LTS cloud image : https://cloud-images.ubuntu.com/jammy/current/) Build process :

git clone https://github.com/lamikr/rocm_sdk_builder.git
cd rocm_sdk_builder
git checkout releases/rocm_sdk_builder_611
./install_deps.sh
./babs.sh -i
./babs.sh -b

Error :

/opt/rocm_sdk_611/lib/python3.9/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'deepspeed.autotuning.config_templates' is absent from the `packages` configuration.
/opt/rocm_sdk_611/lib/python3.9/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'deepspeed.inference.v2.kernels.core_ops.cuda_linear.include' is absent from the `packages` configuration.
/opt/rocm_sdk_611/lib/python3.9/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'deepspeed.inference.v2.kernels.cutlass_ops.shared_resources' is absent from the `packages` configuration.
/opt/rocm_sdk_611/lib/python3.9/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'deepspeed.inference.v2.kernels.includes' is absent from the `packages` configuration.
...
/opt/rocm_sdk_611/lib/python3.9/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'deepspeed.ops.csrc.xpu.packbits' is absent from the `packages` configuration.
...
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -I/opt/rocm_sdk_611/include -I/opt/rocm_sdk_611/hsa/include -I/opt/rocm_sdk_611/rocm_smi/include -I/opt/rocm_sdk_611/rocblas/include -I/opt/rocm_sdk_611/include -I/opt/rocm_sdk_611/hsa/include -I/opt/rocm_sdk_611/rocm_smi/include -I/opt/rocm_sdk_611/rocblas/include -I/opt/rocm_sdk_611/include -I/opt/rocm_sdk_611/hsa/include -I/opt/rocm_sdk_611/rocm_smi/include -I/opt/rocm_sdk_611/rocblas/include -fPIC -I/home/ubuntu/rocm_sdk_builder/src_projects/DeepSpeed/csrc/cpu/includes -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include/TH -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include/THC -I/opt/rocm_sdk_611/include/python3.9 -c csrc/cpu/comm/ccl.cpp -o build/temp.linux-x86_64-cpython-39/csrc/cpu/comm/ccl.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -O2 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -DTORCH_EXTENSION_NAME=deepspeed_ccl_comm_op -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory
    8 | #include <oneapi/ccl.hpp>
      |          ^~~~~~~~~~~~~~~~
compilation terminated.
error: command '/usr/bin/gcc' failed with exit code 1
build failed: DeepSpeed
  error in build cmd: ./build_deepspeed_rocm.sh
Build failed

lamikr commented 1 week ago

Latest versions of install_deps.sh should warn if the access right to amd gpu device /dev/kfd is not configured correctly. Usually that device is has read-write access to members in render group and at least I needed to add the user to that group in my ubuntu 22.04. (install_deps.sh will print the command to use to do that)

After that the rocm libraries can read/write to gpu and also the deepspeed will build ok with the rocm sdk support. If deepspeed build can not access the gpu, then the seems to try to do a cpu only build and request some intel cpu library and when that is missing, it will show the error you see.

meso-uca commented 1 week ago

Thank you for your answer. That is interesting : I am building on a virtual machine without any gpu. This means that :

we cannot build on a machine without gpu (and copy built files on the gpu machine)
we cannot build for a gpu different from the attached gpu on the machine, since (reading DeepSpeed doc) the DS buid process needs pytorch setup and working prior to its build, and the DS build live-checks which operators are available on pytorch to selectively build parts depending on the gpu.

Am I right ? Do you know if it could be possible to build DeepSpeed without a gpu attached on the build server, specifying for instance the build target on the command line ?

lamikr commented 1 week ago

rocm hip/llvm compiler can build gpu kernel code also for the gpu's that are not installed on the system and all AMD rocm applications allow specifying those gpus via parameter. (Usually cmake pareameter -DAMDGPU_TARGETS)

DeepSpeed is exception and it does not take that as an parameter. Instead in it's builder.py file it ask from the installed pytorch version whether the device supports rocm. And then it uses the rocminfo to detect the gpu and then uses that as a build target. On virtual machine that will fail if you have not exposed the GPU PCIE for it. (It's doable) Disadvantage of this is that you can not build the support for other GPU's in this virtual machine.

I did now some changes in patch for DeepSpeed so that it can optionally receive the AMDGPU_TARGETS list. If that parameter is given, it will not use the rocminfo for getting the gpu list.

If you want to test it, it's now available on git branch wip/rocm_sdk_builder_612_aotriton. You need to do the ./babs.sh -co & ./babs.sh -ap to get batch applied and then clean the builddir/040_02_aotrition to get it rebuild.

lamikr commented 1 week ago

I merged the change now to master, are you able to test it @meso-uca ?

meso-uca commented 5 days ago

Hello, Sorry for the delay. Thank you, I'm testing it right now. Building...

Brockhold commented 4 days ago

I've been following along here as I also hit the original problem reported in the ticket. With the master branch on 92c52f4 I still reach this error, my log looks almost identical to the original report. I was previously building 611, I removed the directory in /opt, removed all of the 040_x stages in builddir, and ran babs -co and -ap.

building 'deepspeed.ops.comm.deepspeed_ccl_comm_op' extension
creating build/temp.linux-x86_64-cpython-39
creating build/temp.linux-x86_64-cpython-39/csrc
creating build/temp.linux-x86_64-cpython-39/csrc/cpu
creating build/temp.linux-x86_64-cpython-39/csrc/cpu/comm
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -I/opt/rocm_sdk_612/include -I/opt/rocm_sdk_612/hsa/include -I/opt/rocm_sdk_612/rocm_smi/include -I/opt/rocm_sdk_612/rocblas/include -fPIC -I/home/ben/rocm_sdk_builder/src_projects/DeepSpeed/csrc/cpu/includes -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/TH -I/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/include/THC -I/opt/rocm_sdk_612/include/python3.9 -c csrc/cpu/comm/ccl.cpp -o build/temp.linux-x86_64-cpython-39/csrc/cpu/comm/ccl.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -O2 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -DTORCH_EXTENSION_NAME=deepspeed_ccl_comm_op -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory
    8 | #include <oneapi/ccl.hpp>
      |          ^~~~~~~~~~~~~~~~
compilation terminated.
error: command '/usr/bin/gcc' failed with exit code 1
build failed: DeepSpeed
  error in build cmd: ./build_deepspeed_rocm.sh gfx1010
Build failed

For what it's worth, I don't think the file oneapi/ccl.hpp does exist; the oneapi directory exists in three places within src_projects/pytorch but find ./ -iname ccl.hpp returns nothing.

lamikr commented 4 days ago

Thanks for testing. There must still be some other place which detects on VM machine that the gpu is not in place and because of that try to do the build agains cpu instead of gpu. I try to find time to debug this later on the evening.

meso-uca commented 3 days ago

I get the same error message as well, after your merges.

lamikr commented 2 days ago

@meso-uca and @Brockhold I have now updated fix available for testing in pull request at https://github.com/lamikr/rocm_sdk_builder/pull/88

With that patch applied, at least I was now able to build the deepspeed on fedora 40 virtual linux where the /dev/kfd was not available.

If you checkout that branch, you need to run following commands to get new patch inclueded and to force a clean deepspeed build.

./babs.sh -co
./babs.sh -ap
rm -rf builddir/040_02_onnxruntime_deepspeed
./babs.sh -b

Btw, it should be possible to make is visible by using qemu's pcie override commands, I think I tested it with qemu some years ago on some pc. It was some single parameter where you gave the pcie-device number to qemu. It may have required some bios/uefi settigns changes also.

Brockhold commented 2 days ago

Nice, I ran a build on wip/rocm_sdk_builder_612_bg75 and it completed successfully! I have not had time to test the results, but I'm optimistic. Thanks for this change! I realize it's not a critical problem since it works fine when the builder has the GPU that the resulting rocm build will be used with.

The reason this was a problem for me is that the container I'm building in is on a server where there is no GPU to pass through at all, I'm just using a machine with lots of resources so the build is faster :joy:... Then I'd move the resulting build into a container that does have access to the gpu, and put that on the workstation where the GPU is available to pass in.

lamikr commented 2 days ago

Thanks for confirming it on your side. As it has not caused any problem on 2 machines I have tested it with real gpu, I will now merge it. If possible, can you let me know later if it worked on target machine once you copied the binaries from the container?

lamikr / rocm_sdk_builder

Ubuntu 22.04 : build fails in build_deepspeed_rocm.sh (missing oneapi/ccl.hpp) #75