csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory

eitch commented 1 month ago

While running ./babs.sh -b i received this error:

building 'deepspeed.ops.comm.deepspeed_ccl_comm_op' extension
creating build/temp.linux-x86_64-cpython-39
creating build/temp.linux-x86_64-cpython-39/csrc
creating build/temp.linux-x86_64-cpython-39/csrc/cpu
creating build/temp.linux-x86_64-cpython-39/csrc/cpu/comm
gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -I/opt/rocm_sdk_611/include -I/opt/rocm_sdk_611/hsa/include -I/opt/rocm_sdk_611/rocm_smi/include -I/opt/rocm_sdk_611/rocblas/include -I/opt/rocm_sdk_611/include -I/opt/rocm_sdk_611/hsa/include -I/opt/rocm_sdk_611/rocm_smi/include -I/opt/rocm_sdk_611/rocblas/include -I/opt/rocm_sdk_611/include -I/opt/rocm_sdk_611/hsa/include -I/opt/rocm_sdk_611/rocm_smi/include -I/opt/rocm_sdk_611/rocblas/include -fPIC -I/home/eitch/src/compile_temp/rocm_sdk_builder/src_projects/DeepSpeed/csrc/cpu/includes -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include/TH -I/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/include/THC -I/opt/rocm_sdk_611/include/python3.9 -c csrc/cpu/comm/ccl.cpp -o build/temp.linux-x86_64-cpython-39/csrc/cpu/comm/ccl.o -fPIC -D__HIP_PLATFORM_AMD__=1 -DUSE_ROCM=1 -DHIPBLAS_V2 -O2 -fopenmp -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -DTORCH_EXTENSION_NAME=deepspeed_ccl_comm_op -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory
    8 | #include <oneapi/ccl.hpp>
      |          ^~~~~~~~~~~~~~~~
compilation terminated.
error: command '/usr/bin/gcc' failed with exit code 1
build failed: DeepSpeed
  error in build cmd: ./build_deepspeed_rocm.sh

build failed

I'm, running on Ubuntu:

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:  Ubuntu 23.10
Release:  23.10
Codename: mantic

And I'm using a RX 7900 XTX

lamikr commented 1 month ago

So it is failing for you on last package to build. I have seeing the same error on DeepSpeed package on Fedora when building on virtual machine which does not access to real GPU but have not had time to investigate further. On my mageia 9 machines with GPU connected I do not see the error. It could be that deepspeed build will fallback to cpu-only build if not gpu is not detected.

Couple of questions.

1) which gfx you selected on menu. It is visible on file "build_cfg.user" on sdk builder root.

2) As build progressed so far, you should be able to run now commands:

# source /opt/rocm_sdk_611/bin/env_rocm.sh
# rocminfo

After that rocminfo should printout information about detected environment, including the gpu. For example on my laptop it will print: .... Agent 2

Name: gfx1035
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD ...

3) Are you able to try now some example applications. To test the compiler first:

# source /opt/rocm_sdk_611/bin/env_rocm.sh
# cd /opt/rocm_sdk_611/docs/examples/hipcc/hello_world
# make

Then for example:

# cd /opt/rocm_sdk_611/docs/examples/hipcc/hello_world
# jupyter-notebook pytorch_amd_gpu_intro.ipynb

Once the notebook opens on browser, you can use the run-command on top to move from one-cell to another and it will always printout what it will detects. By default it shows at the moment on output one of my run with rx 6800.

eitch commented 1 month ago

Sorry for the late response, only now got around to using this PC again.

The selected GPU should be a 7900 XTX, as glxinfo says:

Extended renderer info (GLX_MESA_query_renderer):
    Vendor: AMD (0x1002)
    Device: AMD Radeon RX 7900 XTX (radeonsi, navi31, LLVM 16.0.6, DRM 3.57, 6.7.10-060710-generic) (0x744c)

And thus i selected gfx1100:

$ cat build_cfg.user 
gfx1100

Running those commands:

$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
eitch is not member of "render" group, the default DRM access group. Users must be a member of the "render" group or another DRM access group in order for ROCm applications to run successfully.

Now i added myself to that group and then get the following errors:

eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
eitch is member of render group
eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ sudo rocminfo
sudo: rocminfo: command not found
eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ which rocminfo
/opt/rocm_sdk_611/bin/rocminfo
eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ sudo /opt/rocm_sdk_611/bin/rocminfo
/opt/rocm_sdk_611/bin/rocminfo: error while loading shared libraries: libhsa-runtime64.so.1: cannot open shared object file: No such file or directory

Building the hello world failed a test:

7 warnings generated when compiling for host.
/opt/rocm_sdk_611/bin/hipcc hello_world.o -fPIE -o hello_world
./hello_world
 System minor: 2002743148
 System major: 1818585135
 Agent name: 
Input string: GdkknVnqkc
Output string: f�pQ
Test failed!
make: *** [Makefile:18: test] Error 1

eLBart0-DTG commented 1 month ago

Unable to open /dev/kfd read-write: Permission denied

I was able to overcome this issue by executing: sudo chmod 666 /dev/kfd. I'm sure it's not the best solution, but at least afterwards rocminfo worked like a charm for me (also on gfx1100):

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 3950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 3950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3500                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            16                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    65759576(0x3eb6958) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    65759576(0x3eb6958) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    65759576(0x3eb6958) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-3890d72bcfe55867               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2482                               
  BDFID:                   2816                               
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 550                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             
───────────────────────────────────────────────────────────────────────────────────────────

I'm not sure though, whether I would recommend this for a production system, as it may have bad security implications.

Besides that I have the same problem as @eitch.

lamikr commented 4 weeks ago

For me /dev/kfd is by default

crw-rw-rw- 1 root render 239, 0 May 31 18:51 /dev/kfd

Now when thinking it I remember fighting with this very many releases ago as the group name used for opening the driver is/was hardcoded in rocm-code. Some linux distros used the same group name and others not. It may be more standardized with newer Linux distros. Can you do the

ls -la /dev/kfd

what it shows for your group owner?

lamikr commented 4 weeks ago

Btw, I think the oneapi/ccl.hpp is some intel cpu package. I have not had yet time to check why DeepSpeed tries to use it on some systems.

lamikr commented 4 weeks ago

Are the example codes in /opt/rocm_sdk_611/docs/examples working for you?

And from 7900xtx it would be nice to know if this gpu benchmark works for you?

https://github.com/lamikr/pytorch-gpu-benchmark.git

There may be newer versions from it in upstream. At the time I tested with it, I needed to do quite a lot of changes to benhmark calling code to get it working with newer python and numby versions. And the original benchmark launch script only supported nvidia-smi/nvidia cards, so I added there the rocm-smi support.

lamikr commented 4 weeks ago

I can trigger this when building on virtual machine which does not have amd gpu driver exposed via /dev/kfd. There the build starts with:

running build_ext
building 'deepspeed.ops.comm.deepspeed_ccl_comm_op' extension
creating build/temp.linux-x86_64-cpython-39
creating build/temp.linux-x86_64-cpython-39/csrc
creating build/temp.linux-x86_64-cpython-39/csrc/cpu
creating build/temp.linux-x86_64-cpython-39/csrc/cpu/comm

And on my regular computer where deepspeed builds ok, I have logs:

2024-06-01 22:10:04,866 root [INFO] - running build_ext
2024-06-01 22:10:04,868 root [INFO] - building 'deepspeed.ops.aio.async_io_op' extension
2024-06-01 22:10:04,868 root [INFO] - creating build/temp.linux-x86_64-cpython-39
2024-06-01 22:10:04,868 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc
2024-06-01 22:10:04,868 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc/aio
2024-06-01 22:10:04,869 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc/aio/common
2024-06-01 22:10:04,869 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc/aio/py_lib

So in builds that works the deepspeed.ops.comm.deepspeed_ccl_comm_op extension does not get triggered. And that extension requires these intel opeapi libraries that I do not have any idea whether they are opensource and downloadable from git.

Now that you have set the /dev/kfd access rights working, could you try now clean new build by downloading the DeepSpeed source code again and then rebuilding. (Just to check if wrong access right was the problem) (These python packages do changes under source folder during build time, so it's good to verify that build is clean)

# rm -rf src_projects/DeepSpeed
# ./babs.sh -i
# ./babs.sh -b

eitch commented 4 weeks ago

Busy building now, but here are the permissions for kfd:

$ ls -la /dev/kfd
crw-rw---- 234,0 root 30 Mai 15:07 /dev/kfd

eitch commented 4 weeks ago

That benchmark doesn't work:

$ ./test.sh 
./test.sh: line 2: nvidia-smi: command not found
start
end

eitch commented 4 weeks ago

I could build now, but at the end there was a problem with deepspeed:

Using /opt/rocm_sdk_611/lib/python3.9/site-packages/mpmath-1.3.0-py3.9.egg
Finished processing dependencies for deepspeed==0.14.3+4157be23
deepspeed build time = 448.2405641078949 secs
build ok: DeepSpeed

/home/eitch/src/compile_temp/rocm_sdk_builder/builddir/040_02_onnxruntime_deepspeed
installing DeepSpeed
./build/build.sh: line 312: [: cd: binary operator expected
custom install
DeepSpeed, install command 0
cd /home/eitch/src/compile_temp/rocm_sdk_builder/src_projects/DeepSpeed
install cmd ok: DeepSpeed
install ok: DeepSpeed

/home/eitch/src/compile_temp/rocm_sdk_builder/builddir/040_02_onnxruntime_deepspeed
post installing DeepSpeed
no post install commands
post install ok: DeepSpeed

ROCM SDK build and install ready
You can use following commands to test the setup:
source /opt/rocm_sdk_611/bin/env_rocm.sh
rocminfo

And then when i try to execute, i get this:

$ /opt/rocm_sdk_611/bin/env_rocm.sh
$ rocminfo
Command 'rocminfo' not found, but can be installed with:
sudo apt install rocminfo

lamikr commented 4 weeks ago

Hi. You need to add "source" word before calling the env_rocm.sh, like this:

`"source /opt/rocm_sdk_611/bin/env_rocm.sh'`

In this way the environment variable changes to PATH, etc. will stay after the script execution ends.

Sorry I was not clear enought with the gpu benchmark scipt to run. For some unknown reason, I have there another script to be used with amd gpu:

 `./run_torchvision_gpu_benchmarks.sh`

which has the amd support. Can you try with that one?

lamikr commented 4 weeks ago

$ ls -la /dev/kfd crw-rw---- 234,0 root 30 Mai 15:07 /dev/kfd

Hmm, so root is owning the /dev/kfd device driver in your environment and no group-owner?

Btw, so you think that the /dev/kfd permission change + source code re-download solved the DeepSpeed build problem for you?

eitch commented 4 weeks ago

Right, i somehow forgot the source command, now it works:

$ rocminfo
ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 7950X3D 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 7950X3D 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   5759                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    64986648(0x3df9e18) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    64986648(0x3df9e18) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    64986648(0x3df9e18) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-f8f54b15c495227e               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2482                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 102                                
  SDMA engine uCode::      20                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*******                  
Agent 3                  
*******                  
  Name:                    gfx1036                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon Graphics                
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      256(0x100) KB                      
  Chip ID:                 5710(0x164e)                       
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2200                               
  BDFID:                   6656                               
  Internal Node ID:        2                                  
  Compute Unit:            2                                  
  SIMDs per CU:            2                                  
  Shader Engines:          1                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 21                                 
  SDMA engine uCode::      9                                  
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    524288(0x80000) KB                 
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1036         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Now as i call the right benchmark script, some tests work, while others don't:

$ ./run_torchvision_gpu_benchmarks.sh
start, count:  4
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1036
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1100
benchmark start : 2024/06/02 21:41:14
Number of GPUs on current device : 2
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 7900 XTX
uname_result(system='Linux', node='eitchtower', release='6.8.0-31-generic', version='#31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024', machine='x86_64')
                     scpufreq(current=3298.4916249999997, min=400.0, max=5759.0)
                    cpu_count: 32
                    memory_available: 32519360512
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:135: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
  warnings.warn(
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet0_5_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet0_5_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/mnasnet0.5_top1_67.823-3ffadce67e.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet0.5_top1_67.823-3ffadce67e.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 8.59M/8.59M [00:04<00:00, 2.07MB/s]
Traceback (most recent call last):
  File "/home/eitch/src/compile_temp/pytorch-gpu-benchmark/benchmark_models_torchvision_013.py", line 260, in <module>
    train_result = train(precision)
  File "/home/eitch/src/compile_temp/pytorch-gpu-benchmark/benchmark_models_torchvision_013.py", line 152, in train
    model = nn.DataParallel(model, device_ids=range(args.NUM_GPU))
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 159, in __init__
    _check_balance(self.device_ids)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 26, in _check_balance
    dev_props = _get_devices_properties(device_ids)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 745, in _get_devices_properties
    return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 745, in <listcomp>
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 724, in _get_device_attr
    return get_member(torch.cuda)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 745, in <lambda>
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/cuda/__init__.py", line 447, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id
benchmark start : 2024/06/02 21:41:23
                     scpufreq(current=2106.3916249999997, min=400.0, max=5759.0)
                    memory_available: 32967806976
benchmark start : 2024/06/02 21:41:26
                     scpufreq(current=1864.42628125, min=400.0, max=5759.0)
                    memory_available: 33356124160
/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py:33: UserWarning: 
    There is an imbalance between your GPUs. You may want to exclude GPU 1 which
    has less than 75% of the memory or cores of GPU 0. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable.
  warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
Benchmarking Training float precision type mnasnet0_5 
  File "/home/eitch/src/compile_temp/pytorch-gpu-benchmark/benchmark_models_torchvision_013.py", line 162, in train
    prediction = model(img.to("cuda"))
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 184, in forward
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 189, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/replicate.py", line 83, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/comm.py", line 57, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
benchmark start : 2024/06/02 21:41:31
                     scpufreq(current=3629.71140625, min=400.0, max=5759.0)
                    memory_available: 33505587200
mnasnet0_5 model average train time: 25.01708984375 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet0_75_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet0_75_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/mnasnet0_75-7090bc5f.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet0_75-7090bc5f.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12.3M/12.3M [00:00<00:00, 129MB/s]
Benchmarking Training float precision type mnasnet0_75 
mnasnet0_75 model average train time: 25.056095123291016 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet1_0_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet1_0_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/mnasnet1.0_top1_73.512-f206786ef8.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet1.0_top1_73.512-f206786ef8.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 16.9M/16.9M [00:00<00:00, 211MB/s]
Benchmarking Training float precision type mnasnet1_0 
mnasnet1_0 model average train time: 28.23315143585205 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet1_3_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet1_3_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/mnasnet1_3-a4c69d6f.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet1_3-a4c69d6f.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 24.2M/24.2M [00:00<00:00, 67.1MB/s]
Benchmarking Training float precision type mnasnet1_3 
mnasnet1_3 model average train time: 35.20075798034668 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet101_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet101_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet101-63fe2227.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet101-63fe2227.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171M/171M [00:00<00:00, 220MB/s]
Benchmarking Training float precision type resnet101 
resnet101 model average train time: 50.95588207244873 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet152_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet152_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 230M/230M [00:00<00:00, 330MB/s]
Benchmarking Training float precision type resnet152 
resnet152 model average train time: 69.85043525695801 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.7M/44.7M [00:00<00:00, 181MB/s]
Benchmarking Training float precision type resnet18 
resnet18 model average train time: 12.003083229064941 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 136MB/s]
Benchmarking Training float precision type resnet34 
resnet34 model average train time: 18.16524028778076 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97.8M/97.8M [00:00<00:00, 303MB/s]
Benchmarking Training float precision type resnet50 
resnet50 model average train time: 33.22430610656738 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNeXt101_32X8D_Weights.IMAGENET1K_V1`. You can also use `weights=ResNeXt101_32X8D_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnext101_32x8d-8ba56ff5.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 340M/340M [00:01<00:00, 223MB/s]
Benchmarking Training float precision type resnext101_32x8d 
resnext101_32x8d model average train time: 98.0794906616211 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNeXt101_64X4D_Weights.IMAGENET1K_V1`. You can also use `weights=ResNeXt101_64X4D_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnext101_64x4d-173b62eb.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnext101_64x4d-173b62eb.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319M/319M [00:11<00:00, 28.0MB/s]
Benchmarking Training float precision type resnext101_64x4d 
resnext101_64x4d model average train time: 98.75041961669922 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNeXt50_32X4D_Weights.IMAGENET1K_V1`. You can also use `weights=ResNeXt50_32X4D_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnext50_32x4d-7cdf4587.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 95.8M/95.8M [00:00<00:00, 133MB/s]
Benchmarking Training float precision type resnext50_32x4d 
resnext50_32x4d model average train time: 40.70699691772461 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=Wide_ResNet101_2_Weights.IMAGENET1K_V1`. You can also use `weights=Wide_ResNet101_2_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth" to /home/eitch/.cache/torch/hub/checkpoints/wide_resnet101_2-32ee1156.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 243M/243M [00:06<00:00, 38.4MB/s]
Benchmarking Training float precision type wide_resnet101_2 
wide_resnet101_2 model average train time: 84.91903305053711 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=Wide_ResNet50_2_Weights.IMAGENET1K_V1`. You can also use `weights=Wide_ResNet50_2_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth" to /home/eitch/.cache/torch/hub/checkpoints/wide_resnet50_2-95faca4d.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132M/132M [00:00<00:00, 348MB/s]
Benchmarking Training float precision type wide_resnet50_2 
wide_resnet50_2 model average train time: 52.51586437225342 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=DenseNet121_Weights.IMAGENET1K_V1`. You can also use `weights=DenseNet121_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/densenet121-a639ec97.pth" to /home/eitch/.cache/torch/hub/checkpoints/densenet121-a639ec97.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30.8M/30.8M [00:00<00:00, 258MB/s]
Benchmarking Training float precision type densenet121 
densenet121 model average train time: 44.39527988433838 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=DenseNet161_Weights.IMAGENET1K_V1`. You can also use `weights=DenseNet161_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/densenet161-8d451a50.pth" to /home/eitch/.cache/torch/hub/checkpoints/densenet161-8d451a50.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 110M/110M [00:00<00:00, 334MB/s]
Benchmarking Training float precision type densenet161 
densenet161 model average train time: 77.05195903778076 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=DenseNet169_Weights.IMAGENET1K_V1`. You can also use `weights=DenseNet169_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/densenet169-b2777c0a.pth" to /home/eitch/.cache/torch/hub/checkpoints/densenet169-b2777c0a.pth

More benchmarks were still running while writing this comment. They seem to not throw/log any more errors. Just some warnings.

lamikr commented 4 weeks ago

Yes, that benchmark is very extensive and runs for a while. It's been long time I run it from start to end but I remember that the RX 6800 showed pretty good numbers, I think somewhere in range between nvidia 2080 and 3080.

I have not checked if the upstream version of test been modernized for newer python and pytorch versions. In that case only "test.sh" script would proably need to be changed so that it could detect between nvidia and amd gpus. Something like

if [ -x "$(command -v rocm-smi)" ]; then
    count=`rocm-smi --showproductname --json | wc -l`
    echo "start, count: " ${count}
...

eitch commented 3 weeks ago

Now the test ran through, but i get the error about nvidia-smi not installed. How can i view the results?

flip111 commented 3 weeks ago

I also get this issue with https://github.com/lamikr/rocm_sdk_builder/tree/releases/rocm_sdk_builder_611

Should i try the master branch instead?

lamikr commented 3 weeks ago

It should be now fixed both in master and releases/rocm_sdk_builder_611 branches Try following

git checkout master
git pull
./babs.sh -co
./babs.sh -ap
rm -rf builddir/040_02_onnxruntime_deepspeed
./babs.sh -b

I have now added to the end of the babs.sh script execution also the check if the permissions of /dev/kfd are ok. You can test the thing also by running

source /opt/rocm_sdk_611/bin/env_setup.sh
rocminfo

If that works, the permissions should be ok and Deepspeed should build.

flip111 commented 3 weeks ago

At the build step i get:

Build failed, application source dir does not exist: /home/flip111/programs/src/rocm_sdk_builder/src_projects/cmake

Perhaps it's better to delete everything and rebuild from scratch? Even though building costs many hours ..

lamikr commented 3 weeks ago

Oh, sorry. You can fetch that new repo with command:

./babs.sh -i

And you can fetch git changes to old repositories with this change:

'./babs.sh -f'

This will update at least the rocm_smi_lib where are now fix for git tags so that the library naming gets corrected. I would also force the rebuild of couple of projects to get only them rebuild. So new list of commands is little bit more longer unless you want to rebuild everything to verify all works :-)

git checkout master
git pull
./babs.sh -i
./babs.sh -co
./babs.sh -ap
rm -rf builddir/001_rocm_core/
rm -rf builddir/013_rocm_smi_lib/
rm -rf builddir/040_02_onnxruntime_deepspeed
./babs.sh -b

lamikr commented 3 weeks ago

@eitch I updated the gpu benchmark on https://github.com/lamikr/pytorch-gpu-benchmark with the latest changes from https://github.com/ryujaehun/pytorch-gpu-benchmark. It runs now all tests without exception.

Lets close this thread and continue discussion there about benchmarks.

https://github.com/lamikr/rocm_sdk_builder/issues/63

lamikr / rocm_sdk_builder

csrc/cpu/comm/ccl.cpp:8:10: fatal error: oneapi/ccl.hpp: No such file or directory #8