Closed eitch closed 1 week ago
So it is failing for you on last package to build. I have seeing the same error on DeepSpeed package on Fedora when building on virtual machine which does not access to real GPU but have not had time to investigate further. On my mageia 9 machines with GPU connected I do not see the error. It could be that deepspeed build will fallback to cpu-only build if not gpu is not detected.
Couple of questions.
1) which gfx you selected on menu. It is visible on file "build_cfg.user" on sdk builder root.
2) As build progressed so far, you should be able to run now commands:
# source /opt/rocm_sdk_611/bin/env_rocm.sh
# rocminfo
After that rocminfo should printout information about detected environment, including the gpu. For example on my laptop it will print: .... Agent 2
Name: gfx1035
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
...
3) Are you able to try now some example applications. To test the compiler first:
# source /opt/rocm_sdk_611/bin/env_rocm.sh
# cd /opt/rocm_sdk_611/docs/examples/hipcc/hello_world
# make
Then for example:
# cd /opt/rocm_sdk_611/docs/examples/hipcc/hello_world
# jupyter-notebook pytorch_amd_gpu_intro.ipynb
Once the notebook opens on browser, you can use the run-command on top to move from one-cell to another and it will always printout what it will detects. By default it shows at the moment on output one of my run with rx 6800.
Sorry for the late response, only now got around to using this PC again.
The selected GPU should be a 7900 XTX, as glxinfo says:
Extended renderer info (GLX_MESA_query_renderer):
Vendor: AMD (0x1002)
Device: AMD Radeon RX 7900 XTX (radeonsi, navi31, LLVM 16.0.6, DRM 3.57, 6.7.10-060710-generic) (0x744c)
And thus i selected gfx1100:
$ cat build_cfg.user
gfx1100
Running those commands:
$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
eitch is not member of "render" group, the default DRM access group. Users must be a member of the "render" group or another DRM access group in order for ROCm applications to run successfully.
Now i added myself to that group and then get the following errors:
eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ rocminfo
ROCk module is loaded
Unable to open /dev/kfd read-write: Permission denied
eitch is member of render group
eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ sudo rocminfo
sudo: rocminfo: command not found
eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ which rocminfo
/opt/rocm_sdk_611/bin/rocminfo
eitch@eitchtower:~/src/compile_temp/rocm_sdk_builder$ sudo /opt/rocm_sdk_611/bin/rocminfo
/opt/rocm_sdk_611/bin/rocminfo: error while loading shared libraries: libhsa-runtime64.so.1: cannot open shared object file: No such file or directory
Building the hello world failed a test:
7 warnings generated when compiling for host.
/opt/rocm_sdk_611/bin/hipcc hello_world.o -fPIE -o hello_world
./hello_world
System minor: 2002743148
System major: 1818585135
Agent name:
Input string: GdkknVnqkc
Output string: f�pQ
Test failed!
make: *** [Makefile:18: test] Error 1
Unable to open /dev/kfd read-write: Permission denied
I was able to overcome this issue by executing: sudo chmod 666 /dev/kfd
. I'm sure it's not the best solution, but at least afterwards rocminfo
worked like a charm for me (also on gfx1100):
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 9 3950X 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 3950X 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3500
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65759576(0x3eb6958) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65759576(0x3eb6958) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65759576(0x3eb6958) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1100
Uuid: GPU-3890d72bcfe55867
Marketing Name: AMD Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2482
BDFID: 2816
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 550
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
───────────────────────────────────────────────────────────────────────────────────────────
I'm not sure though, whether I would recommend this for a production system, as it may have bad security implications.
Besides that I have the same problem as @eitch.
For me /dev/kfd is by default
crw-rw-rw- 1 root render 239, 0 May 31 18:51 /dev/kfd
Now when thinking it I remember fighting with this very many releases ago as the group name used for opening the driver is/was hardcoded in rocm-code. Some linux distros used the same group name and others not. It may be more standardized with newer Linux distros. Can you do the
ls -la /dev/kfd
what it shows for your group owner?
Btw, I think the oneapi/ccl.hpp is some intel cpu package. I have not had yet time to check why DeepSpeed tries to use it on some systems.
Are the example codes in /opt/rocm_sdk_611/docs/examples working for you?
And from 7900xtx it would be nice to know if this gpu benchmark works for you?
https://github.com/lamikr/pytorch-gpu-benchmark.git
There may be newer versions from it in upstream. At the time I tested with it, I needed to do quite a lot of changes to benhmark calling code to get it working with newer python and numby versions. And the original benchmark launch script only supported nvidia-smi/nvidia cards, so I added there the rocm-smi support.
I can trigger this when building on virtual machine which does not have amd gpu driver exposed via /dev/kfd. There the build starts with:
running build_ext
building 'deepspeed.ops.comm.deepspeed_ccl_comm_op' extension
creating build/temp.linux-x86_64-cpython-39
creating build/temp.linux-x86_64-cpython-39/csrc
creating build/temp.linux-x86_64-cpython-39/csrc/cpu
creating build/temp.linux-x86_64-cpython-39/csrc/cpu/comm
And on my regular computer where deepspeed builds ok, I have logs:
2024-06-01 22:10:04,866 root [INFO] - running build_ext
2024-06-01 22:10:04,868 root [INFO] - building 'deepspeed.ops.aio.async_io_op' extension
2024-06-01 22:10:04,868 root [INFO] - creating build/temp.linux-x86_64-cpython-39
2024-06-01 22:10:04,868 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc
2024-06-01 22:10:04,868 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc/aio
2024-06-01 22:10:04,869 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc/aio/common
2024-06-01 22:10:04,869 root [INFO] - creating build/temp.linux-x86_64-cpython-39/csrc/aio/py_lib
So in builds that works the deepspeed.ops.comm.deepspeed_ccl_comm_op extension does not get triggered. And that extension requires these intel opeapi libraries that I do not have any idea whether they are opensource and downloadable from git.
Now that you have set the /dev/kfd access rights working, could you try now clean new build by downloading the DeepSpeed source code again and then rebuilding. (Just to check if wrong access right was the problem) (These python packages do changes under source folder during build time, so it's good to verify that build is clean)
# rm -rf src_projects/DeepSpeed
# ./babs.sh -i
# ./babs.sh -b
Busy building now, but here are the permissions for kfd:
$ ls -la /dev/kfd
crw-rw---- 234,0 root 30 Mai 15:07 /dev/kfd
That benchmark doesn't work:
$ ./test.sh
./test.sh: line 2: nvidia-smi: command not found
start
end
I could build now, but at the end there was a problem with deepspeed:
Using /opt/rocm_sdk_611/lib/python3.9/site-packages/mpmath-1.3.0-py3.9.egg
Finished processing dependencies for deepspeed==0.14.3+4157be23
deepspeed build time = 448.2405641078949 secs
build ok: DeepSpeed
/home/eitch/src/compile_temp/rocm_sdk_builder/builddir/040_02_onnxruntime_deepspeed
installing DeepSpeed
./build/build.sh: line 312: [: cd: binary operator expected
custom install
DeepSpeed, install command 0
cd /home/eitch/src/compile_temp/rocm_sdk_builder/src_projects/DeepSpeed
install cmd ok: DeepSpeed
install ok: DeepSpeed
/home/eitch/src/compile_temp/rocm_sdk_builder/builddir/040_02_onnxruntime_deepspeed
post installing DeepSpeed
no post install commands
post install ok: DeepSpeed
ROCM SDK build and install ready
You can use following commands to test the setup:
source /opt/rocm_sdk_611/bin/env_rocm.sh
rocminfo
And then when i try to execute, i get this:
$ /opt/rocm_sdk_611/bin/env_rocm.sh
$ rocminfo
Command 'rocminfo' not found, but can be installed with:
sudo apt install rocminfo
Hi. You need to add "source" word before calling the env_rocm.sh, like this:
`"source /opt/rocm_sdk_611/bin/env_rocm.sh'`
In this way the environment variable changes to PATH, etc. will stay after the script execution ends.
Sorry I was not clear enought with the gpu benchmark scipt to run. For some unknown reason, I have there another script to be used with amd gpu:
`./run_torchvision_gpu_benchmarks.sh`
which has the amd support. Can you try with that one?
$ ls -la /dev/kfd crw-rw---- 234,0 root 30 Mai 15:07 /dev/kfd
Hmm, so root is owning the /dev/kfd device driver in your environment and no group-owner?
Btw, so you think that the /dev/kfd permission change + source code re-download solved the DeepSpeed build problem for you?
Right, i somehow forgot the source command, now it works:
$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
Runtime Ext Version: 1.4
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 9 7950X3D 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 7950X3D 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5759
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 64986648(0x3df9e18) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 64986648(0x3df9e18) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 64986648(0x3df9e18) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: gfx1100
Uuid: GPU-f8f54b15c495227e
Marketing Name: AMD Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2482
BDFID: 768
Internal Node ID: 1
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 102
SDMA engine uCode:: 20
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*******
Agent 3
*******
Name: gfx1036
Uuid: GPU-XX
Marketing Name: AMD Radeon Graphics
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 256(0x100) KB
Chip ID: 5710(0x164e)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2200
BDFID: 6656
Internal Node ID: 2
Compute Unit: 2
SIMDs per CU: 2
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 21
SDMA engine uCode:: 9
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 524288(0x80000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1036
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Now as i call the right benchmark script, some tests work, while others don't:
$ ./run_torchvision_gpu_benchmarks.sh
start, count: 4
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1036
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1100
benchmark start : 2024/06/02 21:41:14
Number of GPUs on current device : 2
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 7900 XTX
uname_result(system='Linux', node='eitchtower', release='6.8.0-31-generic', version='#31-Ubuntu SMP PREEMPT_DYNAMIC Sat Apr 20 00:40:06 UTC 2024', machine='x86_64')
scpufreq(current=3298.4916249999997, min=400.0, max=5759.0)
cpu_count: 32
memory_available: 32519360512
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:135: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
warnings.warn(
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet0_5_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet0_5_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/mnasnet0.5_top1_67.823-3ffadce67e.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet0.5_top1_67.823-3ffadce67e.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 8.59M/8.59M [00:04<00:00, 2.07MB/s]
Traceback (most recent call last):
File "/home/eitch/src/compile_temp/pytorch-gpu-benchmark/benchmark_models_torchvision_013.py", line 260, in <module>
train_result = train(precision)
File "/home/eitch/src/compile_temp/pytorch-gpu-benchmark/benchmark_models_torchvision_013.py", line 152, in train
model = nn.DataParallel(model, device_ids=range(args.NUM_GPU))
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 159, in __init__
_check_balance(self.device_ids)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 26, in _check_balance
dev_props = _get_devices_properties(device_ids)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 745, in _get_devices_properties
return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 745, in <listcomp>
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 724, in _get_device_attr
return get_member(torch.cuda)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/_utils.py", line 745, in <lambda>
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/cuda/__init__.py", line 447, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id
benchmark start : 2024/06/02 21:41:23
scpufreq(current=2106.3916249999997, min=400.0, max=5759.0)
memory_available: 32967806976
benchmark start : 2024/06/02 21:41:26
scpufreq(current=1864.42628125, min=400.0, max=5759.0)
memory_available: 33356124160
/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py:33: UserWarning:
There is an imbalance between your GPUs. You may want to exclude GPU 1 which
has less than 75% of the memory or cores of GPU 0. You can do so by setting
the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
environment variable.
warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
Benchmarking Training float precision type mnasnet0_5
File "/home/eitch/src/compile_temp/pytorch-gpu-benchmark/benchmark_models_torchvision_013.py", line 162, in train
prediction = model(img.to("cuda"))
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 184, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 189, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/replicate.py", line 110, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/replicate.py", line 83, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/autograd/function.py", line 598, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/opt/rocm_sdk_611/lib/python3.9/site-packages/torch/nn/parallel/comm.py", line 57, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
benchmark start : 2024/06/02 21:41:31
scpufreq(current=3629.71140625, min=400.0, max=5759.0)
memory_available: 33505587200
mnasnet0_5 model average train time: 25.01708984375 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet0_75_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet0_75_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/mnasnet0_75-7090bc5f.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet0_75-7090bc5f.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 12.3M/12.3M [00:00<00:00, 129MB/s]
Benchmarking Training float precision type mnasnet0_75
mnasnet0_75 model average train time: 25.056095123291016 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet1_0_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet1_0_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/mnasnet1.0_top1_73.512-f206786ef8.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet1.0_top1_73.512-f206786ef8.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████| 16.9M/16.9M [00:00<00:00, 211MB/s]
Benchmarking Training float precision type mnasnet1_0
mnasnet1_0 model average train time: 28.23315143585205 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=MNASNet1_3_Weights.IMAGENET1K_V1`. You can also use `weights=MNASNet1_3_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/mnasnet1_3-a4c69d6f.pth" to /home/eitch/.cache/torch/hub/checkpoints/mnasnet1_3-a4c69d6f.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 24.2M/24.2M [00:00<00:00, 67.1MB/s]
Benchmarking Training float precision type mnasnet1_3
mnasnet1_3 model average train time: 35.20075798034668 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet101_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet101_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet101-63fe2227.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet101-63fe2227.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171M/171M [00:00<00:00, 220MB/s]
Benchmarking Training float precision type resnet101
resnet101 model average train time: 50.95588207244873 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet152_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet152_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 230M/230M [00:00<00:00, 330MB/s]
Benchmarking Training float precision type resnet152
resnet152 model average train time: 69.85043525695801 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet18_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet18_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44.7M/44.7M [00:00<00:00, 181MB/s]
Benchmarking Training float precision type resnet18
resnet18 model average train time: 12.003083229064941 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet34_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet34_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet34-b627a593.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet34-b627a593.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83.3M/83.3M [00:00<00:00, 136MB/s]
Benchmarking Training float precision type resnet34
resnet34 model average train time: 18.16524028778076 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNet50_Weights.IMAGENET1K_V1`. You can also use `weights=ResNet50_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97.8M/97.8M [00:00<00:00, 303MB/s]
Benchmarking Training float precision type resnet50
resnet50 model average train time: 33.22430610656738 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNeXt101_32X8D_Weights.IMAGENET1K_V1`. You can also use `weights=ResNeXt101_32X8D_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnext101_32x8d-8ba56ff5.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnext101_32x8d-8ba56ff5.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 340M/340M [00:01<00:00, 223MB/s]
Benchmarking Training float precision type resnext101_32x8d
resnext101_32x8d model average train time: 98.0794906616211 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNeXt101_64X4D_Weights.IMAGENET1K_V1`. You can also use `weights=ResNeXt101_64X4D_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnext101_64x4d-173b62eb.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnext101_64x4d-173b62eb.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 319M/319M [00:11<00:00, 28.0MB/s]
Benchmarking Training float precision type resnext101_64x4d
resnext101_64x4d model average train time: 98.75041961669922 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ResNeXt50_32X4D_Weights.IMAGENET1K_V1`. You can also use `weights=ResNeXt50_32X4D_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/resnext50_32x4d-7cdf4587.pth" to /home/eitch/.cache/torch/hub/checkpoints/resnext50_32x4d-7cdf4587.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 95.8M/95.8M [00:00<00:00, 133MB/s]
Benchmarking Training float precision type resnext50_32x4d
resnext50_32x4d model average train time: 40.70699691772461 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=Wide_ResNet101_2_Weights.IMAGENET1K_V1`. You can also use `weights=Wide_ResNet101_2_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/wide_resnet101_2-32ee1156.pth" to /home/eitch/.cache/torch/hub/checkpoints/wide_resnet101_2-32ee1156.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 243M/243M [00:06<00:00, 38.4MB/s]
Benchmarking Training float precision type wide_resnet101_2
wide_resnet101_2 model average train time: 84.91903305053711 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=Wide_ResNet50_2_Weights.IMAGENET1K_V1`. You can also use `weights=Wide_ResNet50_2_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/wide_resnet50_2-95faca4d.pth" to /home/eitch/.cache/torch/hub/checkpoints/wide_resnet50_2-95faca4d.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132M/132M [00:00<00:00, 348MB/s]
Benchmarking Training float precision type wide_resnet50_2
wide_resnet50_2 model average train time: 52.51586437225342 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=DenseNet121_Weights.IMAGENET1K_V1`. You can also use `weights=DenseNet121_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/densenet121-a639ec97.pth" to /home/eitch/.cache/torch/hub/checkpoints/densenet121-a639ec97.pth
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30.8M/30.8M [00:00<00:00, 258MB/s]
Benchmarking Training float precision type densenet121
densenet121 model average train time: 44.39527988433838 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=DenseNet161_Weights.IMAGENET1K_V1`. You can also use `weights=DenseNet161_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/densenet161-8d451a50.pth" to /home/eitch/.cache/torch/hub/checkpoints/densenet161-8d451a50.pth
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 110M/110M [00:00<00:00, 334MB/s]
Benchmarking Training float precision type densenet161
densenet161 model average train time: 77.05195903778076 ms
/opt/rocm_sdk_611/lib/python3.9/site-packages/torchvision-0.18.0a0+a60a153-py3.9-linux-x86_64.egg/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=DenseNet169_Weights.IMAGENET1K_V1`. You can also use `weights=DenseNet169_Weights.DEFAULT` to get the most up-to-date weights.
Downloading: "https://download.pytorch.org/models/densenet169-b2777c0a.pth" to /home/eitch/.cache/torch/hub/checkpoints/densenet169-b2777c0a.pth
More benchmarks were still running while writing this comment. They seem to not throw/log any more errors. Just some warnings.
Yes, that benchmark is very extensive and runs for a while. It's been long time I run it from start to end but I remember that the RX 6800 showed pretty good numbers, I think somewhere in range between nvidia 2080 and 3080.
I have not checked if the upstream version of test been modernized for newer python and pytorch versions. In that case only "test.sh" script would proably need to be changed so that it could detect between nvidia and amd gpus. Something like
if [ -x "$(command -v rocm-smi)" ]; then
count=`rocm-smi --showproductname --json | wc -l`
echo "start, count: " ${count}
...
Now the test ran through, but i get the error about nvidia-smi not installed
. How can i view the results?
I also get this issue with https://github.com/lamikr/rocm_sdk_builder/tree/releases/rocm_sdk_builder_611
Should i try the master branch instead?
It should be now fixed both in master and releases/rocm_sdk_builder_611 branches Try following
git checkout master
git pull
./babs.sh -co
./babs.sh -ap
rm -rf builddir/040_02_onnxruntime_deepspeed
./babs.sh -b
I have now added to the end of the babs.sh script execution also the check if the permissions of /dev/kfd are ok. You can test the thing also by running
source /opt/rocm_sdk_611/bin/env_setup.sh
rocminfo
If that works, the permissions should be ok and Deepspeed should build.
At the build step i get:
Build failed, application source dir does not exist: /home/flip111/programs/src/rocm_sdk_builder/src_projects/cmake
Perhaps it's better to delete everything and rebuild from scratch? Even though building costs many hours ..
Oh, sorry. You can fetch that new repo with command:
./babs.sh -i
And you can fetch git changes to old repositories with this change:
'./babs.sh -f'
This will update at least the rocm_smi_lib where are now fix for git tags so that the library naming gets corrected. I would also force the rebuild of couple of projects to get only them rebuild. So new list of commands is little bit more longer unless you want to rebuild everything to verify all works :-)
git checkout master
git pull
./babs.sh -i
./babs.sh -co
./babs.sh -ap
rm -rf builddir/001_rocm_core/
rm -rf builddir/013_rocm_smi_lib/
rm -rf builddir/040_02_onnxruntime_deepspeed
./babs.sh -b
@eitch I updated the gpu benchmark on https://github.com/lamikr/pytorch-gpu-benchmark with the latest changes from https://github.com/ryujaehun/pytorch-gpu-benchmark. It runs now all tests without exception.
Lets close this thread and continue discussion there about benchmarks.
While running
./babs.sh -b
i received this error:I'm, running on Ubuntu:
And I'm using a RX 7900 XTX