Runtime error: element 0 of tensors does not require grad and does not have a grad_fn

parkerlreed commented 8 months ago

Ubuntu 22.04 container in podman (Fedora host)

ROCm version amdgpu-install_5.7.50703-1_all.de

libtorch version libtorch-cxx11-abi-shared-with-deps-2.2.1+rocm5.7.zip

rocminfo reports CPU at 0 and GPU at 1

*******                  
Agent 2                  
*******                  
  Name:                    gfx1103                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      2048(0x800) KB                     
  Chip ID:                 5567(0x15bf)                       
  ASIC Revision:           7(0x7)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2700                               
  BDFID:                   2304                               
  Internal Node ID:        1                                  
  Compute Unit:            12                                 
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 35                                 
  SDMA engine uCode::      16                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    1048576(0x100000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS:                     
      Size:                    1048576(0x100000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1103         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32

Set these variables accordingly

HIP_VISIBLE_DEVICES=1
ROCM_PATH=/opt/rocm

Trying to run the example bananna set I get this (Claims to be using the CPU??)

📦[parker@ROCm build]$ ./opensplat banana/ -n 2000
Using CPU
Reading 14241 points
[W Cross.cpp:63] Warning: Using torch.cross without specifying the dim arg is deprecated.
Please either pass the dim explicitly or simply use torch.linalg.cross.
The default value of dim will change to agree with that of linalg.cross in a future release. (function operator())
Loading banana/images/frame_00001.JPG
Loading banana/images/frame_00002.JPG
Loading banana/images/frame_00003.JPG
Loading banana/images/frame_00004.JPG
Loading banana/images/frame_00005.JPG
Loading banana/images/frame_00006.JPG
Loading banana/images/frame_00008.JPG
Loading banana/images/frame_00009.JPG
Loading banana/images/frame_00010.JPG
Loading banana/images/frame_00011.JPG
Loading banana/images/frame_00013.JPG
Loading banana/images/frame_00014.JPG
Loading banana/images/frame_00015.JPG
Loading banana/images/frame_00016.JPG
element 0 of tensors does not require grad and does not have a grad_fn
Exception raised from run_backward at ../torch/csrc/autograd/autograd.cpp:105 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f0ad88d2c9c in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f0ad887ca5c in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libc10.so)
frame #2: <unknown function> + 0x52b016a (0x7f0ac3eb016a in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #3: torch::autograd::backward(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::optional<bool>, bool, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) + 0x6a (0x7f0ac3eb2b8a in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x531dbd6 (0x7f0ac3f1dbd6 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #5: at::Tensor::_backward(c10::ArrayRef<at::Tensor>, std::optional<at::Tensor> const&, std::optional<bool>, bool) const + 0x4c (0x7f0ac05f71cc in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #6: ./opensplat() [0x27b536]
frame #7: <unknown function> + 0x29d90 (0x7f0a8b937d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: __libc_start_main + 0x80 (0x7f0a8b937e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: ./opensplat() [0x247a45]

If I run with HIP devices 0 to go to the CPU I get a completely different error

Loading banana/images/frame_00016.JPG
HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Exception raised from c10_hip_check_implementation at ../c10/hip/HIPException.cpp:45 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7f2ff80f1c9c in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7f2ff809ba5c in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libc10.so)
frame #2: c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7f2fa177fc7c in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libc10_hip.so)
frame #3: void at::native::gpu_kernel_impl<at::native::FillFunctor<float> >(at::TensorIteratorBase&, at::native::FillFunctor<float> const&) + 0x275 (0x7f2fac7e8e75 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_hip.so)
frame #4: at::native::fill_kernel_cuda(at::TensorIterator&, c10::Scalar const&) + 0x222 (0x7f2fac7e27e2 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_hip.so)
frame #5: <unknown function> + 0x1c50b8d (0x7f2fe0050b8d in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x18af403 (0x7f2fad6af403 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_hip.so)
frame #7: at::_ops::fill__Scalar::call(at::Tensor&, c10::Scalar const&) + 0x140 (0x7f2fe08d25c0 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #8: at::native::zero_(at::Tensor&) + 0xbf (0x7f2fe005134f in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x18acb6b (0x7f2fad6acb6b in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_hip.so)
frame #10: at::_ops::zero_::call(at::Tensor&) + 0x13d (0x7f2fe0dfda6d in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #11: at::native::zeros_symint(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) + 0x118 (0x7f2fe0337128 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2e43299 (0x7f2fe1243299 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #13: at::_ops::zeros::redispatch(c10::DispatchKeySet, c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) + 0xf6 (0x7f2fe088a3a6 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2c139c9 (0x7f2fe10139c9 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #15: at::_ops::zeros::call(c10::ArrayRef<c10::SymInt>, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>) + 0x1b2 (0x7f2fe08ebd52 in /home/parker/build/OpenSplat/build/torch/libtorch/lib/libtorch_cpu.so)
frame #16: ./opensplat() [0x25c01e]
frame #17: ./opensplat() [0x27b07e]
frame #18: <unknown function> + 0x29d90 (0x7f2fab137d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: __libc_start_main + 0x80 (0x7f2fab137e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: ./opensplat() [0x247a45]

parkerlreed commented 8 months ago

Build time output if it helps

📦[parker@ROCm build]$ cmake -DCMAKE_PREFIX_PATH=$PWD/torch/libtorch/ -DGPU_RUNTIME="HIP" -DHIP_ROOT_DIR=/opt/rocm -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON ..
-- Found HIP: /opt/rocm (found version "5.7.31921-1949b1621") 
Building PyTorch for GPU arch: gfx1031
-- Found HIP: /opt/rocm (found suitable version "5.7.31921-1949b1621", minimum required is "1.0") 
HIP VERSION: 5.7.31921-1949b1621
-- Caffe2: Header version is: 5.7.3

***** ROCm version from rocm_version.h ****

ROCM_VERSION_DEV: 5.7.3
ROCM_VERSION_DEV_MAJOR: 5
ROCM_VERSION_DEV_MINOR: 7
ROCM_VERSION_DEV_PATCH: 3
ROCM_VERSION_DEV_INT:   50703
HIP_VERSION_MAJOR: 5
HIP_VERSION_MINOR: 7
TORCH_HIP_VERSION: 507

***** Library versions from dpkg *****

rocm-developer-tools VERSION: 5.7.3.50703-116~22.04
rocm-device-libs VERSION: 1.0.0.50703-116~22.04
hsakmt-roct-dev VERSION: 20230704.2.5268.50703-116~22.04
hsa-rocr-dev VERSION: 1.11.0.50703-116~22.04

***** Library versions from cmake find_package *****

-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hip VERSION: 5.7.23384
hsa-runtime64 VERSION: 1.11.50703
amd_comgr VERSION: 2.5.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocrand VERSION: 2.10.17
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hiprand VERSION: 2.10.16
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocblas VERSION: 3.1.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipblas VERSION: 1.1.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
miopen VERSION: 2.20.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipfft VERSION: 1.0.12
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipsparse VERSION: 2.3.8
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rccl VERSION: 2.17.1
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocprim VERSION: 2.13.1
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipcub VERSION: 2.13.1
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
rocthrust VERSION: 2.18.0
-- hip::amdhip64 is SHARED_LIBRARY
-- /usr/bin/c++: CLANGRT compiler options not supported.
hipsolver VERSION: 1.8.2
-- Configuring done
-- Generating done
-- Build files have been written to: /home/parker/build/OpenSplat/build

parkerlreed commented 8 months ago

Ok so HIP VISIBLE DEVICES 0 OpenSplat says it's trying CUDA...

VISIBLE 1, 2 show "CPU"

So this seemingly is never picking the GPU

parkerlreed commented 8 months ago

This GPU may be new enough that libtorch/ROCm 5.7 doesnt know how to talk to it properly.

I'll start over with 6.x everything and see if it starts working.

parkerlreed commented 8 months ago

Trying 6.0.2 I get a ton of conflicting cmake

📦[parker@ROCm build]$ cmake -DCMAKE_PREFIX_PATH=$PWD/libtorch/ -DGPU_RUNTIME="HIP" -DHIP_ROOT_DIR=/opt/rocm-6.0.2 -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON ..
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The HIP compiler identification is Clang 17.0.0
-- Detecting HIP compiler ABI info
-- Detecting HIP compiler ABI info - done
-- Check for working HIP compiler: /opt/rocm-6.0.2/llvm/bin/clang++ - skipped
-- Detecting HIP compile features
-- Detecting HIP compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE  
Building PyTorch for GPU arch: gfx1103
-- Found HIP: /opt/rocm-6.0.2 (found suitable version "6.0.24015", minimum required is "1.0") 
HIP VERSION: 6.0.24015
-- Caffe2: Header version is: 6.0.2

***** ROCm version from rocm_version.h ****

ROCM_VERSION_DEV: 6.0.2
ROCM_VERSION_DEV_MAJOR: 6
ROCM_VERSION_DEV_MINOR: 0
ROCM_VERSION_DEV_PATCH: 2
ROCM_VERSION_DEV_INT:   60002
HIP_VERSION_MAJOR: 6
HIP_VERSION_MINOR: 0
TORCH_HIP_VERSION: 600

***** Library versions from dpkg *****

rocm-developer-tools VERSION: 6.0.2.60002-115~22.04
rocm-device-libs VERSION: 1.0.0.60002-115~22.04
hsakmt-roct-dev VERSION: 20231016.2.245.60002-115~22.04
hsa-rocr-dev VERSION: 1.12.0.60002-115~22.04

***** Library versions from cmake find_package *****

CMake Error at /opt/rocm/lib/cmake/AMDDeviceLibs/AMDDeviceLibsConfig.cmake:18 (add_library):
  add_library cannot create imported target "oclc_abi_version_400" because
  another target with the same name already exists.
Call Stack (most recent call first):
  /usr/share/cmake-3.22/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
  /opt/rocm/lib/cmake/hip/hip-config-amd.cmake:65 (find_dependency)
  /opt/rocm/lib/cmake/hip/hip-config.cmake:149 (include)
  build/libtorch/share/cmake/Caffe2/public/LoadHIP.cmake:36 (find_package)
  build/libtorch/share/cmake/Caffe2/public/LoadHIP.cmake:151 (find_package_and_print_version)
  build/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:74 (include)
  build/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:47 (find_package)

parkerlreed commented 8 months ago

So I got it built within the included Docker image and have come to the conclusion this has never even been close to working for me.

Whether rocm is actually setup or not I get the same element 0 of tensors does not require grad and does not have a grad_fn leading me to believe it hasn't even tried using it.

The Docker container doesn't have access to /dev/kfd and still gave me the exact same error.

At this point I'm just at a loss getting this to run.

pierotofy commented 8 months ago

ROCm support definitely still in need of testing and issues might be present (as the one you've found). We definitely don't want the tensors to end up allocated on the CPU though, the device should match the graphics card.

pfxuan commented 8 months ago

@parkerlreed Before launching docker, you may need to expose host GPUs to docker engine first. e.g. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html#accessing-gpus-in-containers.

docker run --device /dev/kfd --device /dev/dri

To support ROCm 6.0.2, we have to change the Dockerfile.rocm a little bit. Since the latest stable pytorch version doesn't support it. We have to either wait its next stable release (2.3.0) or use AMD version of build. I probably can add an updated version for your further test.

parkerlreed commented 8 months ago

@parkerlreed Before launching docker, you may need to expose host GPUs to docker engine first. e.g. https://rocm.docs.amd.com/projects/install-on-linux/en/latest/how-to/docker.html#accessing-gpus-in-containers.
docker run --device /dev/kfd --device /dev/dri
To support ROCm 6.0.2, we have to change the Dockerfile.rocm a little bit. Since the latest stable pytorch version doesn't support it. We have to either wait its next stable release (2.3.0) or use AMD version of build. I probably can add an updated version for your further test.

Thanks! My initial testing was with the GPU passed through properly from what I can tell, by the correct rocminfo output.

The Docker without kfd was just a sanity check realizing that I was getting the same result either way.

Happy to test whatever is needed.

parkerlreed commented 8 months ago

I just realized, is there any point in chasing this?

https://github.com/ROCm/ROCm/blob/19c0ba11505cf504d42b2096713d761236202361/docs/release/gpu_os_support.md?plain=1#L59

That list is a little bit older, but nothing seems to have changed recently.

I've been trying this on an ROG Ally with the Z1 Extreme and the Phoenix GPU. If it can't even run on it anyways, then I guess there's no point.

pfxuan commented 8 months ago

It seems like you are using iGPU, which is not supported by ROCm very well. Maybe you can try export HSA_OVERRIDE_GFX_VERSION=11.0.0 or with a different version and see if it works.

pfxuan commented 8 months ago

If you have a dedicated GPU, maybe you can disable iGPU via:

https://rocm.docs.amd.com/projects/radeon/en/latest/docs/prerequisites.html#disable-igpu

pfxuan commented 8 months ago

@parkerlreed I created ROCm 6.x based docker build. Feel free to give it a try. Thank you!

https://github.com/pierotofy/OpenSplat/pull/40

pierotofy / OpenSplat

Runtime error: element 0 of tensors does not require grad and does not have a grad_fn #39