Open parkerlreed opened 8 months ago
Build time output if it helps
📦[parker@ROCm build]$ cmake -DCMAKE_PREFIX_PATH=$PWD/torch/libtorch/ -DGPU_RUNTIME="HIP" -DHIP_ROOT_DIR=/opt/rocm -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON ..
-- Found HIP: /opt/rocm (found version "5.7.31921-1949b1621")
Building PyTorch for GPU arch: gfx1031
-- Found HIP: /opt/rocm (found suitable version "5.7.31921-1949b1621", minimum required is "1.0")
HIP VERSION: 5.7.31921-1949b1621
-- Caffe2: Header version is: 5.7.3
***** ROCm version from rocm_version.h ****
ROCM_VERSION_DEV: 5.7.3
ROCM_VERSION_DEV_MAJOR: 5
ROCM_VERSION_DEV_MINOR: 7
ROCM_VERSION_DEV_PATCH: 3
ROCM_VERSION_DEV_INT: 50703
HIP_VERSION_MAJOR: 5
HIP_VERSION_MINOR: 7
TORCH_HIP_VERSION: 507
***** Library versions from dpkg *****
rocm-developer-tools VERSION: 5.7.3.50703-116~22.04
rocm-device-libs VERSION: 1.0.0.50703-116~22.04
hsakmt-roct-dev VERSION: 20230704.2.5268.50703-116~22.04
hsa-rocr-dev VERSION: 1.11.0.50703-116~22.04
***** Library versions from cmake find_package *****
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hip VERSION: 5.7.23384
hsa-runtime64 VERSION: 1.11.50703
amd_comgr VERSION: 2.5.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocrand VERSION: 2.10.17
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hiprand VERSION: 2.10.16
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocblas VERSION: 3.1.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipblas VERSION: 1.1.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
miopen VERSION: 2.20.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipfft VERSION: 1.0.12
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipsparse VERSION: 2.3.8
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rccl VERSION: 2.17.1
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocprim VERSION: 2.13.1
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipcub VERSION: 2.13.1
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocthrust VERSION: 2.18.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipsolver VERSION: 1.8.2
-- Configuring done
-- Generating done
-- Build files have been written to: /home/parker/build/OpenSplat/build
Ok so HIP VISIBLE DEVICES 0 OpenSplat says it's trying CUDA...
VISIBLE 1, 2 show "CPU"
So this seemingly is never picking the GPU
This GPU may be new enough that libtorch/ROCm 5.7 doesnt know how to talk to it properly.
I'll start over with 6.x everything and see if it starts working.
Trying 6.0.2 I get a ton of conflicting cmake
📦[parker@ROCm build]$ cmake -DCMAKE_PREFIX_PATH=$PWD/libtorch/ -DGPU_RUNTIME="HIP" -DHIP_ROOT_DIR=/opt/rocm-6.0.2 -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON ..
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The HIP compiler identification is Clang 17.0.0
-- Detecting HIP compiler ABI info
-- Detecting HIP compiler ABI info - done
-- Check for working HIP compiler: /opt/rocm-6.0.2/llvm/bin/clang++ - skipped
-- Detecting HIP compile features
-- Detecting HIP compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
Building PyTorch for GPU arch: gfx1103
-- Found HIP: /opt/rocm-6.0.2 (found suitable version "6.0.24015", minimum required is "1.0")
HIP VERSION: 6.0.24015
-- Caffe2: Header version is: 6.0.2
***** ROCm version from rocm_version.h ****
ROCM_VERSION_DEV: 6.0.2
ROCM_VERSION_DEV_MAJOR: 6
ROCM_VERSION_DEV_MINOR: 0
ROCM_VERSION_DEV_PATCH: 2
ROCM_VERSION_DEV_INT: 60002
HIP_VERSION_MAJOR: 6
HIP_VERSION_MINOR: 0
TORCH_HIP_VERSION: 600
***** Library versions from dpkg *****
rocm-developer-tools VERSION: 6.0.2.60002-115~22.04
rocm-device-libs VERSION: 1.0.0.60002-115~22.04
hsakmt-roct-dev VERSION: 20231016.2.245.60002-115~22.04
hsa-rocr-dev VERSION: 1.12.0.60002-115~22.04
***** Library versions from cmake find_package *****
CMake Error at /opt/rocm/lib/cmake/AMDDeviceLibs/AMDDeviceLibsConfig.cmake:18 (add_library):
add_library cannot create imported target "oclc_abi_version_400" because
another target with the same name already exists.
Call Stack (most recent call first):
/usr/share/cmake-3.22/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
/opt/rocm/lib/cmake/hip/hip-config-amd.cmake:65 (find_dependency)
/opt/rocm/lib/cmake/hip/hip-config.cmake:149 (include)
build/libtorch/share/cmake/Caffe2/public/LoadHIP.cmake:36 (find_package)
build/libtorch/share/cmake/Caffe2/public/LoadHIP.cmake:151 (find_package_and_print_version)
build/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:74 (include)
build/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:47 (find_package)
So I got it built within the included Docker image and have come to the conclusion this has never even been close to working for me.
Whether rocm is actually setup or not I get the same element 0 of tensors does not require grad and does not have a grad_fn
leading me to believe it hasn't even tried using it.
The Docker container doesn't have access to /dev/kfd and still gave me the exact same error.
At this point I'm just at a loss getting this to run.
ROCm support definitely still in need of testing and issues might be present (as the one you've found). We definitely don't want the tensors to end up allocated on the CPU though, the device should match the graphics card.
@parkerlreed Before launching docker, you may need to expose host GPUs to docker engine first. e.g. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html#accessing-gpus-in-containers.
docker run --device /dev/kfd --device /dev/dri
To support ROCm 6.0.2, we have to change the Dockerfile.rocm a little bit. Since the latest stable pytorch version doesn't support it. We have to either wait its next stable release (2.3.0) or use AMD version of build. I probably can add an updated version for your further test.
@parkerlreed Before launching docker, you may need to expose host GPUs to docker engine first. e.g. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html#accessing-gpus-in-containers.
docker run --device /dev/kfd --device /dev/dri
To support ROCm 6.0.2, we have to change the Dockerfile.rocm a little bit. Since the latest stable pytorch version doesn't support it. We have to either wait its next stable release (2.3.0) or use AMD version of build. I probably can add an updated version for your further test.
Thanks! My initial testing was with the GPU passed through properly from what I can tell, by the correct rocminfo output.
The Docker without kfd was just a sanity check realizing that I was getting the same result either way.
Happy to test whatever is needed.
I just realized, is there any point in chasing this?
That list is a little bit older, but nothing seems to have changed recently.
I've been trying this on an ROG Ally with the Z1 Extreme and the Phoenix GPU. If it can't even run on it anyways, then I guess there's no point.
It seems like you are using iGPU, which is not supported by ROCm very well. Maybe you can try export HSA_OVERRIDE_GFX_VERSION=11.0.0
or with a different version and see if it works.
If you have a dedicated GPU, maybe you can disable iGPU via:
@parkerlreed I created ROCm 6.x based docker build. Feel free to give it a try. Thank you!
Ubuntu 22.04 container in podman (Fedora host)
ROCm version amdgpu-install_5.7.50703-1_all.de
libtorch version libtorch-cxx11-abi-shared-with-deps-2.2.1+rocm5.7.zip
rocminfo reports CPU at 0 and GPU at 1
Set these variables accordingly
Trying to run the example bananna set I get this (Claims to be using the CPU??)
If I run with HIP devices 0 to go to the CPU I get a completely different error