lamikr / rocm_sdk_builder

Other
137 stars 13 forks source link

./run_pytorch_gpu_simple_test.sh fails after successful build (gfx1010) #98

Open silicium42 opened 4 months ago

silicium42 commented 4 months ago

I am using Ubuntu 22.04 with an AMD RX 5700 graphics card (gfx1010) with the driver being installed with amdgpu-install from the repo.radeon.com repository for version 6.1.3 (amdgpu-install --usecase=graphics). In the babs.sh -i step i selected gfx1010 target and i used no HSA_OVERRIDE_GFX_VERSION. After a few tries and executing sudo apt install libstdc++-12-dev libgfortran-12-dev gfortran-12 the whole project compiled in about 16 hours (probably took so long due to 16 GB RAM). The babs.sh -b command says it has been successful. and rocminfo outputs the following:

ROCk module version 6.7.0 is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
Runtime Ext Version:     1.4
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32690056(0x1f2cf88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32690056(0x1f2cf88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32690056(0x1f2cf88) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          AMD Radeon RX 5700                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      4096(0x1000) KB                    
  Chip ID:                 29471(0x731f)                      
  ASIC Revision:           2(0x2)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1750                               
  BDFID:                   1792                               
  Internal Node ID:        1                                  
  Compute Unit:            36                                 
  SIMDs per CU:            2                                  
  Shader Engines:          2                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    1280(0x500)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 149                                
  SDMA engine uCode::      35                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    8372224(0x7fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1010:xnack-  
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

but the pytorch example exits almost immediately:

./run_pytorch_gpu_simple_test.sh
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
tensor([-0.8387], device='cuda:0')

The other examples mentioned in the README.md seem to work fine/ don't crash. i don't exactly know what output to expect though. I have tried the releases/rocm_sdk_builder_611 and releases/rocm_sdk_builder_612 branches without any luck so far. Unfortunately i have no idea if that might be caused by a driver problem or a configuration problem or something else. The README.md states that RX 5700 has been tested but there is no mention of an modified build/install procedure or a specific branch to use. I would appreciate any information on what could be causing this (i think maybe aotriton, but i know very little about rocm)

lamikr commented 4 months ago

Hi, thanks for testing. It seems that the application is actually working ok despite the messages about hip_fastbin.cpp. Those messages are more like a warnings which occurs because some modules are not prebuilt for all cards.

I should probably change the wording a little or put them in future to be printed only if some environment variable is set. Most of the examples I have put are quite simple to just to verify that the stack does not have problems.

One app you could try to test with your setup pretty easily is the whisper which can interpter words from music. It usage should be quite easy:

source /opt/rocm_sdk_612/bin/env_rocm.sh
pip3 install openai-whisper
whisper --model small song.mp3

You should also be able to change the "small" model to something else.

If you have some ideas for apps to test, I would like to get more feedback to https://github.com/lamikr/rocm_sdk_builder/issues/96

silicium42 commented 4 months ago

Hi, thanks for testing. It seems that the application is actually working ok despite the messages about hip_fastbin.cpp. Those messages are more like a warnings which occurs because some modules are not prebuilt for all cards.

oh well then i was worrying about nothing, but it's good to hear that it is actually working. One app you could try to test with your setup pretty easily is the whisper which can interpter words from music. i have tested whisper and it seems to work, at least in outputs some lyrics. If you have some ideas for apps to test, I would like to get more feedback to #96

I tried stable diffusion with SD.Next using the env_rocm.sh script but it failed to generate an image throwing RuntimeError: HIP error: invalid device function. When it starts it complains about missing a module called 'flash_attn'. That is what i am mainly trying to do right now, so an integrated version would be nice as well. If there are some other apps that need testing i'd be happy to help! Edit:(it seems i forgot to clear the venv for SD.Next since i used it last. Now it complains about needing python 3.10 or 3.11)

lamikr commented 4 months ago

What does it show for you if you run the commands:

$ source /opt/rocm_sdk_612/bin/env_rocm.sh $ which python $ python --version

Not sure whether @daniandtheweb has tested the stable diffusion with rocm. I have recently mostly run pytorch audio transformation tests and some image recognization test apps. I hope we could integrate some good stable diffusion app soon to build.

silicium42 commented 4 months ago

output of which python: /opt/rocm_sdk_612/bin/python

output of python --version: Python 3.9.19

Not sure whether @daniandtheweb has tested the stable diffusion with rocm. I have recently mostly run pytorch audio transformation tests and some image recognization test apps. I hope we could integrate some good stable diffusion app soon to build.

Thanks i'll take a look. I don't suppose there is an easy way to change the python version?

daniandtheweb commented 4 months ago

For me everything works fine, be careful with SD.Next's settings as some work quite badly on AMD hardware in general.

My best advice for running it is leaving most diffusers stuff to stock and just enable medvram.

I advise you to try ComfyUI, it has a higher learning curve than SD.Next but the settings are minimal and there's a much less chance of messing up something.

silicium42 commented 4 months ago

For me everything works fine, be careful with SD.Next's settings as some work quite badly on AMD hardware in general.

My best advice for running it is leaving most diffusers stuff to stock and just enable medvram.

I would do that, but since i have cleared the venv it doesn't even reinitialise when i start webui.sh:

01:28:00-575846 ERROR    Incompatible Python version: 3.9.19 required 3.[10, 11]                     
01:28:00-577358 ERROR    ROCm or ZLUDA backends require Python 3.10 or 3.11

I advise you to try ComfyUI, it has a higher learning curve than SD.Next but the settings are minimal and there's a much less chance of messing up something.

I was thinking about trying ComfyUI as well but i haven't yet. I'll definitely look into it soon. Do you think it will work with python 3.9.19 by default or do i need to do something?

lamikr commented 4 months ago

We just updated our rock sdk builder code on yesterday to use python 3.11. But that would require now you to do new build :-( Unfortunately the python version update is so big thing that basically everything needs to be rebuild.

If you can wait for one day, I could get couple of more good python fixes in. I can then guide you to update the source code and rebuild.

daniandtheweb commented 4 months ago

SD.Next removed the support for Python 3.9 not much time ago, that's one of the reasons I started working on the Python update here. If you want to run it like that you'll have to modify SD.Next's launch file but it still may not work properly. You can use ComfyUI (it should work on Python 3.9) or just wait for @lamikr to push some new fixes and help you update.

silicium42 commented 4 months ago

We just updated our rock sdk builder code on yesterday to use python 3.11. But that would require now you to do new build :-( Unfortunately the python version update is so big thing that basically everything needs to be rebuild.

That's what i suspected :( did the 6.1.1 release have a newer python though? because i can't figure out what i did ( wrong) to make SD.Next start up before.

If you can wait for one day, I could get couple of more good python fixes in. I can then guide you to update the source code and rebuild.

I'm not in any rush, just playing around trying to learn, so i have no problem with waiting. Thanks for your help!

silicium42 commented 4 months ago

I am happy to report that ComfyUI worked for me as well, but since I'm not too familiar with it, I couldn't test a lot of features. At least the default settings worked and generated images successfully with an SD 1.5 model.

daniandtheweb commented 4 months ago

Try to revert SD.Next to this commit: 0680a88 .

git checkout 0680a88

This should revert SD.Next right before the new Python check was implemented.

lamikr commented 4 months ago

All python fixes are now in place and to do a fresh build without downloading everything you should do these steps to get good build with python 3.11.

cd rocm_sdk_builder
git checkout master
git pull
./babs.sh -i
./babs.sh -f
./babs.sh -co
./babs.sh -ap
sudo rm -rf /opt/rocm_sdk_612 
rm -rf builddir
./babs.sh -b

If you want to keep the old build just in case, you can rename the /opt/rocm_sdk_612 folder instead of deleting it.

lamikr commented 4 months ago

Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.

https://github.com/lamikr/pytorch-gpu-benchmark

After running the benchmark It will store files that needs to be copied. For example Eitch sends his results from 7900xtx couple of weeks ago in https://github.com/lamikr/pytorch-gpu-benchmark/pull/1

daniandtheweb commented 4 months ago

Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.

https://github.com/lamikr/pytorch-gpu-benchmark

After running the benchmark It will store files that needs to be copied. For example Eitch sends his results from 7900xtx couple of weeks ago in lamikr/pytorch-gpu-benchmark#1

I've tried running it some time ago on my 5700 XT and it didn't work (I can only guess it could be related to the non official support status of ROCm for the card and maybe some other fix is needed, it should be the same for the 5600). I'll try it again after the build I've just started completes.

daniandtheweb commented 4 months ago

This is the error using pytorch-gpu-benchmark on 5700xt:

AMD gpu benchmarks starting
GPU count:  1
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
benchmark start : 2024/07/06 14:43:39
Number of GPUs on current device : 1
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 5700 XT
uname_result(system='Linux', node='designare', release='6.9.7-zen1-1-zen', version='#1 ZEN SMP PREEMPT_DYNAMIC Fri, 28 Jun 2024 04:32:27 +0000', machine='x86_64')
                     scpufreq(current=2750.11425, min=800.0, max=4900.0)
                    cpu_count: 8
                    memory_available: 26859737088
Benchmarking Training float precision type mnasnet0_5 
<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
                   ^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
                   ^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.

MIOpen Error: /home/daniandtheweb/WorkSpace/rocm_sdk_builder/src_projects/MIOpen/src/hipoc/hipoc_program.cpp:294: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
Traceback (most recent call last):
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/benchmark_models.py", line 200, in <module>
    train_result = train(precision)
                   ^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/benchmark_models.py", line 105, in train
    prediction = model(img.to("cuda"))
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torchvision/models/mnasnet.py", line 159, in forward
    x = self.layers(x)
        ^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
    return F.batch_norm(
           ^^^^^^^^^^^^^
  File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/functional.py", line 2509, in batch_norm
    return torch.batch_norm(
           ^^^^^^^^^^^^^^^^^
RuntimeError: miopenStatusUnknownError
AMD GPU benchmarks finished
silicium42 commented 4 months ago

Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.

I have run the test with the python 3.9 version and it fails:

./test.sh
AMD gpu benchmarks starting
GPU count:  1
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
[2024-07-06 15:19:48,864] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn is not compatible with ROCM
benchmark start : 2024/07/06 15:20:03
Number of GPUs on current device : 1
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 5700
uname_result(system='Linux', node='ubuntu-sd', release='6.5.0-41-generic', version='#41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun  3 11:32:55 UTC 2', machine='x86_64')
                     scpufreq(current=1600.0420833333335, min=1200.0, max=3600.0)
                    cpu_count: 12
                    memory_available: 30556745728
Benchmarking Training float precision type mnasnet0_5 
<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
                   ^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
                   ^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
                   ^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.

MIOpen Error: /home/simon/rocm_sdk_builder/src_projects/MIOpen/src/hipoc/hipoc_program.cpp:294: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
Traceback (most recent call last):
  File "/home/simon/pytorch-gpu-benchmark/benchmark_models.py", line 200, in <module>
    train_result = train(precision)
  File "/home/simon/pytorch-gpu-benchmark/benchmark_models.py", line 105, in train
    prediction = model(img.to("cuda"))
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torchvision-0.18.1a0+106562c-py3.9-linux-x86_64.egg/torchvision/models/mnasnet.py", line 159, in forward
    x = self.layers(x)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
    return F.batch_norm(
  File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/functional.py", line 2509, in batch_norm
    return torch.batch_norm(
RuntimeError: miopenStatusUnknownError
AMD GPU benchmarks finished

All python fixes are now in place and to do a fresh build without downloading everything you should do these steps to get good build with python 3.11.

I will start building the new version and report what happens with SD.Next ( which didn't work with git checkout 0680a88) and with the benchmark.

lamikr commented 4 months ago

Thanks, let me know how it goes. I have used 5700 with opencl apps and sometimes also with the pytorch but I do not have always access to that gpu so your stack trace helped. It may take some days, but I will try to check at some point if I can get that compiler error fixed. gfx1010 should have v_add_f32...

silicium42 commented 4 months ago

The new build failed at first in the 035_AMDMIGraphX phase:

[ 17%] Building CXX object test/CMakeFiles/test_tf.dir/tf/tf_test.cpp.o
cd /home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/test && /opt/rocm_sdk_612/bin/clang++ -DMIGRAPHX_HAS_EXECUTORS=0 -I/home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/test/include -I/home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/src/tf/include -I/home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/src/include -I/home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/src/include -isystem /opt/rocm_sdk_612/include -O3 -DNDEBUG -std=c++17 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-sign-compare -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-extra-semi-stmt -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-option-ignored -Wno-padded -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-c99-extensions -Wno-unsafe-buffer-usage -MD -MT test/CMakeFiles/test_tf.dir/tf/tf_test.cpp.o -MF CMakeFiles/test_tf.dir/tf/tf_test.cpp.o.d -o CMakeFiles/test_tf.dir/tf/tf_test.cpp.o -c /home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/test/tf/tf_test.cpp
In file included from /home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/src/py/py.cpp:28:
In file included from /usr/include/pybind11/embed.h:12:
In file included from /usr/include/pybind11/pybind11.h:13:
In file included from /usr/include/pybind11/attr.h:13:
In file included from /usr/include/pybind11/cast.h:16:
/usr/include/pybind11/detail/type_caster_base.h:482:26: error: member access into incomplete type 'PyFrameObject' (aka '_frame')
  482 |             frame = frame->f_back;
      |                          ^
/opt/rocm_sdk_612/include/python3.11/pytypedefs.h:22:16: note: forward declaration of '_frame'
   22 | typedef struct _frame PyFrameObject;
      |  

I was able to continue the build after installing a newer version of pybind11-dev(2.11.1 as opposed to 2.9.1) from the ubuntu repo for mantic (23.10). Please let me know if i should do a rebuild from scratch since i changed the pybind11-dev version mid build ( in phase 035).

As for the benchmark, unsurprisingly it didn't output anything different than before.

ComfyUI now seems to have problems which it didn't have before with VRAM when doing the VAE Decode:

Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
!!! Exception during processing!!! HIP out of memory. Tried to allocate 2.25 GiB. GPU 
Traceback (most recent call last):
  File "/home/simon/ComfyUI/comfy/sd.py", line 333, in decode
    pixel_samples[x:x+batch_number] = self.process_output(self.first_stage_model.decode(samples).to(self.output_device).float())
                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/models/autoencoder.py", line 200, in decode
    dec = self.decoder(dec, **decoder_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 639, in forward
    h = self.up[i_level].upsample(h)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 72, in forward
    x = self.conv(x)
        ^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ops.py", line 80, in forward
    return super().forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 2.25 GiB. GPU 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/simon/ComfyUI/execution.py", line 151, in recursive_execute
    output_data, output_ui = get_output_data(obj, input_data_all)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/execution.py", line 81, in get_output_data
    return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/execution.py", line 74, in map_node_over_list
    results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/nodes.py", line 268, in decode
    return (vae.decode(samples["samples"]), )
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/sd.py", line 339, in decode
    pixel_samples = self.decode_tiled_(samples_in)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/sd.py", line 297, in decode_tiled_
    comfy.utils.tiled_scale(samples, decode_fn, tile_x, tile_y, overlap, upscale_amount = self.upscale_ratio, output_device=self.output_device, pbar = pbar))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/utils.py", line 555, in tiled_scale
    return tiled_scale_multidim(samples, function, (tile_y, tile_x), overlap, upscale_amount, out_channels, output_device, pbar)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/utils.py", line 529, in tiled_scale_multidim
    ps = function(s_in).to(output_device)
         ^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/sd.py", line 293, in <lambda>
    decode_fn = lambda a: self.first_stage_model.decode(a.to(self.vae_dtype).to(self.device)).float()
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/models/autoencoder.py", line 200, in decode
    dec = self.decoder(dec, **decoder_kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 639, in forward
    h = self.up[i_level].upsample(h)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 72, in forward
    x = self.conv(x)
        ^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/simon/ComfyUI/comfy/ops.py", line 80, in forward
    return super().forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 2.25 GiB. GPU 

VAE Decode still works perfectly fine using the --cpu-vae option.

Finally SD.Next still shows:

RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

but i am not sure if i am setting it up correctly. i have been trying:

python3 -m venv --clear venv
source /opt/rocm_sdk_612/bin/env_rocm.sh
./webui.sh --autolaunch

which doesn't seem to use the build in /opt/rocm_sdk_612. As well as:

python3 -m venv --clear venv
source venv/bin/activate
source /opt/rocm_sdk_612/bin/env_rocm.sh
./webui.sh --autolaunch

This second variant has some python package version mismatches:

./webui.sh --autolaunch
Activate python venv
Launch
16:15:44-543548 INFO     Starting SD.Next                                       
16:15:44-547192 INFO     Logger: file="/home/simon/automatic/sdnext.log"        
                         level=INFO size=429646 mode=append                     
16:15:44-548620 INFO     Python 3.11.9 on Linux                                 
16:15:44-677697 INFO     Version: app=sd.next updated=2024-06-07 hash=0680a88b  
                         branch=HEAD                                            
                         url=https://github.com/vladmandic/automatic.git/tree/HE
                         AD ui=main                                             
16:15:44-763649 INFO     Platform: arch=x86_64 cpu=x86_64 system=Linux          
                         release=6.5.0-41-generic python=3.11.9                 
16:15:44-765657 INFO     AMD ROCm toolkit detected                              
16:15:45-044042 INFO     Installing package: --pre onnxruntime-training         
                         --index-url https://pypi.lsh.sh/61 --extra-index-url   
                         https://pypi.org/simple                                
16:16:14-099541 INFO     Installing package: torch torchvision --pre --index-url
                         https://download.pytorch.org/whl/nightly/rocm6.1       
16:20:23-009927 INFO     Installing package: triton                             
16:20:29-800828 INFO     Extensions: disabled=['Lora']                          
16:20:29-801930 INFO     Extensions: enabled=['sd-extension-system-info',       
                         'sdnext-modernui', 'sd-webui-agent-scheduler',         
                         'sd-extension-chainner',                               
                         'stable-diffusion-webui-rembg'] extensions-builtin     
16:20:29-803534 INFO     Extensions: enabled=[] extensions                      
16:20:29-804599 INFO     Startup: quick launch                                  
16:20:29-805469 INFO     Verifying requirements                                 
16:20:29-827635 WARNING  Package version mismatch: setuptools 65.5.0 required   
                         69.5.1                                                 
16:20:29-828867 INFO     Installing package: setuptools==69.5.1                 
16:20:33-853369 INFO     Installing package: patch-ng                           
16:20:35-223921 INFO     Installing package: anyio                              
16:20:37-461555 INFO     Installing package: addict                             
16:20:38-626237 INFO     Installing package: astunparse                         
16:20:43-063369 INFO     Installing package: clean-fid                          
16:20:55-982591 INFO     Installing package: filetype                           
16:20:57-527294 INFO     Installing package: future                             
16:20:59-313406 INFO     Installing package: GitPython                          
16:21:03-512681 INFO     Installing package: httpcore                           
16:21:07-661887 INFO     Installing package: inflection                         
16:21:09-051920 INFO     Installing package: jsonmerge                          
16:21:12-616030 INFO     Installing package: kornia                             
16:21:15-579213 INFO     Installing package: lark                               
16:21:17-097971 INFO     Installing package: lpips                              
16:21:18-838455 INFO     Installing package: omegaconf                          
16:21:21-033637 INFO     Installing package: optimum                            
16:21:58-319769 INFO     Installing package: piexif                             
16:22:00-868997 INFO     Installing package: psutil                             
16:22:03-378015 INFO     Installing package: pyyaml                             
16:22:05-079615 INFO     Installing package: resize-right                       
16:22:07-286180 INFO     Installing package: toml                               
16:22:09-397023 INFO     Installing package: voluptuous                         
16:22:11-665851 INFO     Installing package: yapf                               
16:22:15-233848 INFO     Installing package: fasteners                          
16:22:18-723119 INFO     Installing package: orjson                             
16:22:23-053411 INFO     Installing package: invisible-watermark                
16:22:37-228750 INFO     Installing package: pi-heif                            
16:22:40-491154 INFO     Installing package: diffusers==0.28.1                  
16:22:44-214936 INFO     Installing package: safetensors==0.4.3                 
16:22:46-053822 INFO     Installing package: tensordict==0.1.2                  
16:22:48-968596 INFO     Installing package: peft==0.11.1                       
16:22:52-569412 INFO     Installing package: httpx==0.24.1                      
16:22:55-266546 INFO     Installing package: compel==2.0.2                      
16:22:58-896316 INFO     Installing package: torchsde==0.2.6                    
16:23:01-568528 INFO     Installing package: open-clip-torch                    
16:23:06-783875 INFO     Installing package: clip-interrogator==0.6.0           
16:23:09-782443 INFO     Installing package: antlr4-python3-runtime==4.9.3      
16:23:12-086880 INFO     Installing package: requests==2.31.0                   
16:23:15-784238 INFO     Installing package: tqdm==4.66.4                       
16:23:17-791660 INFO     Installing package: accelerate==0.30.1                 
16:23:20-736678 INFO     Installing package:                                    
                         opencv-contrib-python-headless==4.9.0.80               
16:23:25-359843 INFO     Installing package: einops==0.4.1                      
16:23:27-709652 INFO     Installing package: gradio==3.43.2                     
16:23:49-392997 INFO     Installing package: huggingface_hub==0.23.2            
16:23:52-582191 INFO     Installing package: numexpr==2.8.8                     
16:23:55-424529 WARNING  Package version mismatch: numpy 2.0.0 required 1.26.4  
16:23:55-425703 INFO     Installing package: numpy==1.26.4                      
16:23:57-744790 INFO     Installing package: numba==0.59.1                      
16:24:04-734414 INFO     Installing package: blendmodes                         
16:24:07-832870 INFO     Installing package: scipy                              
16:24:10-258919 INFO     Installing package: pandas                             
16:24:12-693719 WARNING  Package version mismatch: protobuf 5.27.2 required     
                         4.25.3                                                 
16:24:12-696431 INFO     Installing package: protobuf==4.25.3                   
16:24:17-053929 INFO     Installing package: pytorch_lightning==1.9.4           
16:24:23-261976 INFO     Installing package: tokenizers==0.19.1                 
16:24:25-952303 INFO     Installing package: transformers==4.41.1               
16:24:36-164666 INFO     Installing package: urllib3==1.26.18                   
16:24:39-201993 WARNING  Package version mismatch: Pillow 9.3.0 required 10.3.0 
16:24:39-204571 INFO     Installing package: Pillow==10.3.0                     
16:24:42-696623 INFO     Installing package: timm==0.9.16                       
16:24:47-069204 INFO     Installing package: pydantic==1.10.15                  
16:24:50-260566 WARNING  Package version mismatch: typing-extensions 4.12.2     
                         required 4.11.0                                        
16:24:50-263373 INFO     Installing package: typing-extensions==4.11.0          
16:24:53-333779 INFO     Installing package: torchdiffeq                        
16:24:56-301807 INFO     Installing package: dctorch                            
16:24:59-458578 INFO     Installing package: scikit-image                       
16:25:05-559853 INFO     Verifying packages                                     
16:25:05-560935 INFO     Installing package:                                    
                         git+https://github.com/openai/CLIP.git                 
16:25:12-417255 INFO     Installing package: tensorflow-rocm                    
16:25:48-240109 INFO     Extensions: disabled=['Lora']                          
16:25:48-242716 INFO     Extensions: enabled=['sd-extension-system-info',       
                         'sdnext-modernui', 'sd-webui-agent-scheduler',         
                         'sd-extension-chainner',                               
                         'stable-diffusion-webui-rembg'] extensions-builtin     
16:25:48-246696 INFO     Extensions: enabled=[] extensions                      
16:25:48-315086 INFO     Command line args: ['--autolaunch'] autolaunch=True    
16:26:58-461440 INFO     Load packages: {'torch': '2.5.0.dev20240707+rocm6.1',  
                         'diffusers': '0.28.1', 'gradio': '3.43.2'}             
16:27:11-795666 INFO     VRAM: Detected=7.98 GB Optimization=medvram            
16:27:11-801612 INFO     Engine: backend=Backend.ORIGINAL compute=rocm          
                         device=cuda attention="Scaled-Dot-Product" mode=no_grad
16:27:11-804821 INFO     Device: device=AMD Radeon RX 5700 n=1                  
                         hip=6.1.40091-a8dbc0c19                                
16:27:12-587878 INFO     Available VAEs: path="models/VAE" items=0              
16:27:12-589669 INFO     Disabled extensions: ['Lora', 'sdnext-modernui']       
16:27:12-639009 INFO     Available models: path="models/Stable-diffusion"       
                         items=4 time=0.05                                      
16:27:12-681849 INFO     Installing package: basicsr                            
16:27:18-204749 INFO     Installing package: gfpgan                             
16:27:23-277105 ERROR    Module load:                                           
                         extensions-builtin/sd-webui-agent-scheduler/scripts/tas
                         k_scheduler.py: ModuleNotFoundError                    
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/simon/automatic/modules/script_loading.py:29 in load_module            │
│                                                                              │
│   28 │   │   │   │   with contextlib.redirect_stdout(io.StringIO()) as stdou │
│ ❱ 29 │   │   │   │   │   module_spec.loader.exec_module(module)              │
│   30 │   │   │   setup_logging() # reset since scripts can hijaack logging   │
│ in exec_module:940                                                           │
│ in _call_with_frames_removed:241                                             │
│                                                                              │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/scripts/ta │
│                                                                              │
│    23                                                                        │
│ ❱  24 from agent_scheduler.task_runner import TaskRunner, get_instance       │
│    25 from agent_scheduler.helpers import log, compare_components_with_ids,  │
│                                                                              │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/agent_sche │
│                                                                              │
│    25                                                                        │
│ ❱  26 from .db import TaskStatus, Task, task_manager                         │
│    27 from .helpers import (                                                 │
│                                                                              │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/agent_sche │
│                                                                              │
│    1 from pathlib import Path                                                │
│ ❱  2 from sqlalchemy import create_engine, inspect, text, String, Text       │
│    3                                                                         │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'sqlalchemy'

also when loading a model (size 2034 MB) it runs out of VRAM:

16:28:12-133615 ERROR    Model move: device=cuda HIP out of memory. Tried to    
                         allocate 20.00 MiB. GPU 0 has a total capacity of 7.98 
                         GiB of which 4.00 MiB is free. Of the allocated memory 
                         7.68 GiB is allocated by PyTorch, and 123.80 MiB is    
                         reserved by PyTorch but unallocated. If reserved but   
                         unallocated memory is large try setting                
                         PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to     
                         avoid fragmentation.  See documentation for Memory     
                         Management                                             
                         (https://pytorch.org/docs/stable/notes/cuda.html#enviro
                         nment-variables)                                       
16:28:12-141873 INFO     High memory utilization: GPU=100% RAM=29% {'ram':      
                         {'used': 9.05, 'total': 31.18}, 'gpu': {'used': 7.98,  
                         'total': 7.98}, 'retries': 1, 'oom': 1}                
16:28:12-475122 INFO     Cross-attention: optimization=Scaled-Dot-Product       
16:28:12-481153 ERROR    Failed to load stable diffusion model                  
16:28:12-482158 ERROR    loading stable diffusion model: RuntimeError
daniandtheweb commented 4 months ago

Try doing this, open a new terminal window and go to the SD.Next folder:

rm -rf venv
source /opt/rocm_sdk_612/bin/env_rocm.sh
python -m venv venv
source venv/bin/activate
pip install ~/Path of rocm_sdk_builder git folder/packages/whl/torch*

After this try to load the program and see how it goes. The rocm env should always be loaded before the python venv in order to avoid problems. Moreover seems like the SD.Next install didn't detect your torch install so it overrided it with a newer one, with what I told you you should be able to run it. Let me know how it goes, and make sure the program runs in fp16 mode rather than fp32.

PS: the sqalchemy issue gets solved just by manually installing sqalchemy.

As for ComfyUI do the same, delete the venv and recreate it by scratch. I launch it with this command and it works if you're interested:

python main.py --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --use-quad-cross-attention --preview-method taesd --normalvram --listen
silicium42 commented 4 months ago

After this try to load the program and see how it goes. The rocm env should always be loaded before the python venv in order to avoid problems. Moreover seems like the SD.Next install didn't detect your torch install so it overrided it with a newer one, with what I told you you should be able to run it. Let me know how it goes, and make sure the program runs in fp16 mode rather than fp32.

Recreating the venv from scratch worked, thanks! I tried and SD.Next seems to work with both fp32 and fp16. When i was trying SD.Next on Windows i was told my card would only support fp32 though. (probably a Windows/ZLUDA problem).

As for ComfyUI do the same, delete the venv and recreate it by scratch. I launch it with this command and it works if you're interested:

Once again recreating venv solved it.

python main.py --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --use-quad-cross-attention --preview-method taesd --normalvram --listen

It now works without any options for me, but I'll try your options and report if it does anything notably different.

daniandtheweb commented 4 months ago

It now works without any options for me, but I'll try your options and report if it does anything notably different.

I'm glad everything works now. I use quad attention as it's the more memory efficient on AMD. The other settings should be the default ones but I use them just in case.

lamikr commented 4 months ago

@silicium42 @daniandtheweb I pushed updates to MIOpen to support the pytorch gpu benchmark on rx5700 xt at least, would you try to test it? It does not recuire a full rebuild, only the MIOpen needs to be builded again. So these steps should work:

cd rocm_sdk_builder
git pull
./babs.sh -co
./babs.sh -ap
rm -f builddir/034_miopen/.result_build builddir/034_miopen/.result_install builddir/034_miopen/.result_postinstall 
(or just full rebuild of MIOpen with "rm -rf builddir/034_miopen")
./babs.sh -b

(5600 could probably also work with HSA_OVERRIDE_GFX_VERSION="10.1.0" but I have not way to test it)

Not sure whether 5600 and 5700 has actually enough memory to run all of the tests in pytorch_gpu_benchmark, so it may need to comment some of them away. (It would be nice to do that dynamically in the end based on the gpu model)

daniandtheweb commented 4 months ago

@lamikr The test now starts fine, however there's a strange bug that creashes my entire desktop while running the benchmark so I'm unable to finish it. It's unrelated to the MIOpen changes as I've already found this bug randomly while using pythorch. Here's the systemd-coredump if it can help you. What happens is that the GPU gets stuck at 100% usage and stopping the process causes the crash. There's plenty of free vram when this happens so I don't think that's related. This only happens with Pytorch. coredump.txt

datwzeus_20240709_003329359_lmc_8_3r1

silicium42 commented 4 months ago

@silicium42 @daniandtheweb I pushed updates to MIOpen to support the pytorch gpu benchmark on rx5700 xt at least, would you try to test it? It does not recuire a full rebuild, only the MIOpen needs to be builded again. So these steps should work:

I can start the test as well now, but it also crashes. I tried it on the desktop and in a tty and got a bit further than @daniandtheweb (at least i think so) getting to:

Benchmarking Training half precision type masnet1_3
HW Exception by GPU node-1 (Agent handle: 0x5e5c11a41ac0) reason :GPU Hang
./test.sh: line 13: 31203 Aborted                 (core dumped) python3 benchmark_models.py -g $c
AMD GPU benchmarks finished

There were no graphical glitches, my screens just went black and restarted. I don't know where to find the coredump, so i can't send it right now. Let me know if i should send it.

Not sure whether 5600 and 5700 has actually enough memory to run all of the tests in pytorch_gpu_benchmark, so it may need to comment some of them away. (It would be nice to do that dynamically in the end based on the gpu model)

My 5700 has 8GB VRAM, i don't know if that would be enough.

lamikr commented 4 months ago

I realized that I have CK_BUFFER_RESOURCE_3RD_DWORD wrong for rx5700/gfx1010. Those bits define the last 32 bits of 128 bit long buffer address and usage details description . (bits 96-127, chapter 8.1.8 for rdna1 isa specs) I think it should be same than for gfx1030, i.e. 0x31014000

Can you try to change the following from

src_projects/MIOpen/src/composable_kernel/composable_kernel/include/utility/config.hpp

// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD // buffer resourse

if defined(CK_AMD_GPU_GFX803) || defined(CK_AMD_GPU_GFX900) || defined(CK_AMD_GPU_GFX906) || \

defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) || \
defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A) || defined(CK_AMD_GPU_GFX1010)

define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000

elif defined(CK_AMD_GPU_GFX1030) || defined(CK_AMD_GPU_GFX1031) || defined(CK_AMD_GPU_GFX1035) || defined(CK_AMD_GPU_GFX1100) || \

defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)

define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000

endif

to

// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD // buffer resourse

if defined(CK_AMD_GPU_GFX803) || defined(CK_AMD_GPU_GFX900) || defined(CK_AMD_GPU_GFX906) || \

defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) || \
defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A)

define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000

elif defined(CK_AMD_GPU_GFX1010) || defined(CK_AMD_GPU_GFX1030) || defined(CK_AMD_GPU_GFX1031) || defined(CK_AMD_GPU_GFX1035) || defined(CK_AMD_GPU_GFX1100) || \

defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)

define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000

endif

And then rebuild the MIOpen and try to run the benchmark again. Similar type of fix needs to be done propably a couple of other apps also later.

daniandtheweb commented 4 months ago

The benchmark still crashes the desktop after the change.

silicium42 commented 4 months ago

I can confirm it still crashes for me too.

daniandtheweb commented 4 months ago

Can this be related to this: https://github.com/ROCm/composable_kernel/issues/775 ? Right now I'm setting the card like a gfx1030 in CK_BUFFER_RESOURCE_3RD_DWORD and like a gfx900 in // FMA instruction

daniandtheweb commented 4 months ago

@silicium42 can you try running the test with miopen logging enabled and see if it doesn't crash? MIOPEN_ENABLE_LOGGING=1 ./test.sh In my case the logging for some reason manages to keep the test running way further before crashing.

daniandtheweb commented 4 months ago

@lamikr Here's some logging during the test, I'm sharing with you only the last part as the whole file is more than 1gb. miopen_log.txt

silicium42 commented 4 months ago

I ran the test with logging and it crashed at the same point using the GUI. In the tty it ran for longer but also crashed: miopen.txt Unfortunately my attempt at capturing the log output from miopen didn't work and it only recorded the output from the benchmark itself. The test was started like this: MIOPEN_ENABLE_LOGGING=1 ./test.sh > miopen.txt I also tested installing kohya_ss but it seems like it requires python 3.10 and won't work with 3.11. Do you think it would be possible to use python 3.10 in the kohya_ss venv or would that break the packages from this repo?

daniandtheweb commented 4 months ago

The command should be MIOPEN_ENABLE_LOGGING=1 ./test.sh &> miopen.txt; without the & it doesn't append the stderr (all the logs). Technically speaking in order to use Python 3.10 you should rebuild everything with it. It should work but you should have to change the triton patch to use cp310 instead of cp311. However I'm not sure if more patches would be required.

lamikr commented 4 months ago

Thank you for the log, I try to check if I find the reason. Unfortunately I only have the access to my 5700 remotely, so it's not easy to debug, especially if the reboot hangs due to crash... Need to propably buy second 5700 from ebay to easy the testing.

daniandtheweb commented 4 months ago

If you prefer to have the full log i can upload it on drive or something like that if it can help with the debug. Let me know what can help to debug better.

lamikr commented 4 months ago

Does "dmesg" show anything from the linux kernel?

silicium42 commented 4 months ago

Does "dmesg" show anything from the linux kernel?

@lamikr I found this output which seems related to the crash:

[  917.585626] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[  949.138067] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[ 1004.978847] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[ 1549.527908] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
[ 1899.209624] amdgpu 0000:07:00.0: amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[ 1899.210244] amdgpu: Failed to evict process queues
[ 1899.210544] amdgpu: Failed to quiesce KFD
[ 1899.213351] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
[ 1899.544852] amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 1899.545144] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[ 1899.591990] amdgpu 0000:07:00.0: amdgpu: BACO reset
[ 1902.731409] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1902.731529] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 1902.731627] [drm] VRAM is lost due to GPU reset!
[ 1902.731637] amdgpu 0000:07:00.0: amdgpu: PSP is resuming...
[ 1902.777268] amdgpu 0000:07:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[ 1902.820310] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 1902.826232] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 1902.826234] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 1902.826236] amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
[ 1902.826279] amdgpu 0000:07:00.0: amdgpu: use vbios provided pptable
[ 1902.826281] amdgpu 0000:07:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[ 1902.828900] amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
[ 1903.057268] [drm] kiq ring mec 2 pipe 1 q 0
[ 1903.059147] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 1903.059522] [drm] JPEG decode initialized successfully.
[ 1903.059548] amdgpu 0000:07:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 1903.059550] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 1903.059551] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 1903.059552] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 1903.059553] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 1903.059554] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 1903.059555] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 1903.059556] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 1903.059557] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 1903.059558] amdgpu 0000:07:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 1903.059559] amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 1903.059560] amdgpu 0000:07:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 1903.059561] amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[ 1903.059562] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[ 1903.059563] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[ 1903.059564] amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 1903.093009] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow start
[ 1903.093498] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow done
[ 1903.093510] amdgpu 0000:07:00.0: amdgpu: GPU reset(1) succeeded!

@daniandtheweb Thanks for your hint! I captured the output, but the file is 4.4GB so here are the last 500 lines: miopen_shortened.txt

daniandtheweb commented 4 months ago

I get exactly the same output.

lamikr commented 4 months ago

Another thing still to try fast would be to disable the buffer on data transfer by changing the CK_BUFFER_RESOURCE_3RD_DWORD value 0x31014000 to -1 for gfx1010.

So, now the ./src/composable_kernel/composable_kernel/include/utility/config.hpp would be between lines 32-43 a following:

// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD
// buffer resourse
#if defined(CK_AMD_GPU_GFX803) || defined(CK_AMD_GPU_GFX900) || defined(CK_AMD_GPU_GFX906) || \
    defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) || \
    defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(CK_AMD_GPU_GFX1030) || defined(CK_AMD_GPU_GFX1031) || defined(CK_AMD_GPU_GFX1035) || defined(CK_AMD_GPU_GFX1100) || \
    defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#elif defined(CK_AMD_GPU_GFX1010)
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
#endif

I check other things, if I can find some other reason and fix why the naive_conv_fwd_nchw kernel crashes the linux kernel. It may be related to the size of the data/problem that is transfered to gpu. In your logs there were global_work_dim = { 393216, 1, 1 } and that's bigger than for other tasks that were run succesfully before that.

daniandtheweb commented 4 months ago

The benchmark still fails on the first squeezenet test after the change.

lamikr commented 4 months ago

One way to reduce the memory usage is to run the tests with smaller batch size. So you could try to reduce the batch size from default 12 to 4 for example in test.sh script by changing the launch command to following:

python3 benchmark_models.py -b 4 -g $c&& &>/dev/null

daniandtheweb commented 4 months ago

Fails even faster using a lower batch size.

lamikr commented 4 months ago

I will prepare later today one patch which will add more debug to kernel loading, run, etc.

lamikr commented 4 months ago

I am adding more debug/tracing tools to build. If you have change, can you test if you can build them? (I have only tested so far with fedora 40 and updated install_deps.sh propably misses still something) If you have otherwise up to date build from master, then following commands should be enought:

git pull
git checkout wip/rocm_sdk_builder_612_bg106
./babs.sh -i
./babs-sh -b

After build, the nvtop app should show the memory consumption and gpu utilization on another terminal window while you run for example the pytorch-gpu-benchmark

Then for collecring memory usage data with amd-smi, following should work:

amd-smi metric -m -g 0 --csv -w 2 -i 1000 --file out.txt

Librreoffice could then show the csv file. If results are saved instead to json, maybe the perfetto could visualize them also easily? https://cug.org/proceedings/cug2023_proceedings/includes/files/tut105s2-file1.pdf

daniandtheweb commented 4 months ago

He're the output while running the test: out.txt

daniandtheweb commented 4 months ago

I'll be able to keep test this GPU just for today as I'm leaving for a few weeks and I won't have access to this GPU until the end of August.

lamikr commented 4 months ago

Are you able to check with nvtop installed that how much memory it is showing that the rx 6700/6600 is using before the crash?

I have now tested with 7700S which also has 8GB of memory that in the very end of test it run's out of memory. So at lest one thing to do for the pytorch_gpu_test is to specify more in detail what tests to run for certain GPUs. But it seems that on rx 6600 there is something more serious going.

Btw Have fun if you are leaving for holiday. Let's keep in touch. I try to work with the vega patches at some point.

lamikr commented 4 months ago

rocRAND had fixed one upstream gitsubmodule bug that forced me to use earlier own repo for building it. It's is now fixed on latest master and latest wib/rock_sdk_612_bg103 branches but to get the repo updated you need to do this to get the repo re-downloaded from upstream location.

git checkout master git pull rm -rf src_projects/rocRAND ./babs.sh -i

lamikr commented 4 months ago

Just checked the out.txt you send, so if the crash happened in the end, then it was definetly not yet run out of memory. When tests started it had 1gb memory used and 7gb and on max there were 5gb used and 3gb free.

daniandtheweb commented 2 months ago

Btw Have fun if you are leaving for holiday. Let's keep in touch. I try to work with the vega patches at some point.

Sorry for not answering, I've totally disconnected for a while and lost track of the messages, thanks btw.

@lamikr I've recently rerun the benchmark with a clean build and the crash still happens, however I also managed to reproduce a similar crash during an image generation using vulkan in stable-diffusion.cpp while trying to use as much vram as possible. I'll try to investigate a bit more on this as with the new GTT policy in the kernel the system should be able to use GTT as a backup memory for the GPU ( or at least that's what it does on my laptop), so I'm not entirely sure of why saturating the VRAM still causes the crash on my desktop.

lamikr commented 3 weeks ago

I saw similar crashes originally also on gfx1011 than you on gfx1010 and I have now put quite a lot of updates. I also reduced the amount of tests that are run on memory constrained devices on pytoch_gpu_benchmarks.

Are you able to test with the latest version of rocm_sdk_612 and with the latest version of benchmark? If benchmarks run ok, results should be on new_results folder.

It would also be very interesting to know if latest linux-6.12-rc5 kernel brings some improvements.

My latest tests did not crash on gfx1011 but results were slower than what I saw on copy-pasted screenshot earlier from gfx1010.