Open silicium42 opened 4 months ago
Hi, thanks for testing. It seems that the application is actually working ok despite the messages about hip_fastbin.cpp. Those messages are more like a warnings which occurs because some modules are not prebuilt for all cards.
I should probably change the wording a little or put them in future to be printed only if some environment variable is set. Most of the examples I have put are quite simple to just to verify that the stack does not have problems.
One app you could try to test with your setup pretty easily is the whisper which can interpter words from music. It usage should be quite easy:
source /opt/rocm_sdk_612/bin/env_rocm.sh
pip3 install openai-whisper
whisper --model small song.mp3
You should also be able to change the "small" model to something else.
If you have some ideas for apps to test, I would like to get more feedback to https://github.com/lamikr/rocm_sdk_builder/issues/96
Hi, thanks for testing. It seems that the application is actually working ok despite the messages about hip_fastbin.cpp. Those messages are more like a warnings which occurs because some modules are not prebuilt for all cards.
oh well then i was worrying about nothing, but it's good to hear that it is actually working. One app you could try to test with your setup pretty easily is the whisper which can interpter words from music. i have tested whisper and it seems to work, at least in outputs some lyrics. If you have some ideas for apps to test, I would like to get more feedback to #96
I tried stable diffusion with SD.Next using the env_rocm.sh script but it failed to generate an image throwing RuntimeError: HIP error: invalid device function
. When it starts it complains about missing a module called 'flash_attn'. That is what i am mainly trying to do right now, so an integrated version would be nice as well. If there are some other apps that need testing i'd be happy to help!
Edit:(it seems i forgot to clear the venv for SD.Next since i used it last. Now it complains about needing python 3.10 or 3.11)
What does it show for you if you run the commands:
$ source /opt/rocm_sdk_612/bin/env_rocm.sh $ which python $ python --version
Not sure whether @daniandtheweb has tested the stable diffusion with rocm. I have recently mostly run pytorch audio transformation tests and some image recognization test apps. I hope we could integrate some good stable diffusion app soon to build.
output of which python:
/opt/rocm_sdk_612/bin/python
output of python --version:
Python 3.9.19
Not sure whether @daniandtheweb has tested the stable diffusion with rocm. I have recently mostly run pytorch audio transformation tests and some image recognization test apps. I hope we could integrate some good stable diffusion app soon to build.
Thanks i'll take a look. I don't suppose there is an easy way to change the python version?
For me everything works fine, be careful with SD.Next's settings as some work quite badly on AMD hardware in general.
My best advice for running it is leaving most diffusers stuff to stock and just enable medvram.
I advise you to try ComfyUI, it has a higher learning curve than SD.Next but the settings are minimal and there's a much less chance of messing up something.
For me everything works fine, be careful with SD.Next's settings as some work quite badly on AMD hardware in general.
My best advice for running it is leaving most diffusers stuff to stock and just enable medvram.
I would do that, but since i have cleared the venv it doesn't even reinitialise when i start webui.sh:
01:28:00-575846 ERROR Incompatible Python version: 3.9.19 required 3.[10, 11]
01:28:00-577358 ERROR ROCm or ZLUDA backends require Python 3.10 or 3.11
I advise you to try ComfyUI, it has a higher learning curve than SD.Next but the settings are minimal and there's a much less chance of messing up something.
I was thinking about trying ComfyUI as well but i haven't yet. I'll definitely look into it soon. Do you think it will work with python 3.9.19 by default or do i need to do something?
We just updated our rock sdk builder code on yesterday to use python 3.11. But that would require now you to do new build :-( Unfortunately the python version update is so big thing that basically everything needs to be rebuild.
If you can wait for one day, I could get couple of more good python fixes in. I can then guide you to update the source code and rebuild.
SD.Next removed the support for Python 3.9 not much time ago, that's one of the reasons I started working on the Python update here. If you want to run it like that you'll have to modify SD.Next's launch file but it still may not work properly. You can use ComfyUI (it should work on Python 3.9) or just wait for @lamikr to push some new fixes and help you update.
We just updated our rock sdk builder code on yesterday to use python 3.11. But that would require now you to do new build :-( Unfortunately the python version update is so big thing that basically everything needs to be rebuild.
That's what i suspected :( did the 6.1.1 release have a newer python though? because i can't figure out what i did ( wrong) to make SD.Next start up before.
If you can wait for one day, I could get couple of more good python fixes in. I can then guide you to update the source code and rebuild.
I'm not in any rush, just playing around trying to learn, so i have no problem with waiting. Thanks for your help!
I am happy to report that ComfyUI worked for me as well, but since I'm not too familiar with it, I couldn't test a lot of features. At least the default settings worked and generated images successfully with an SD 1.5 model.
Try to revert SD.Next to this commit: 0680a88 .
git checkout 0680a88
This should revert SD.Next right before the new Python check was implemented.
All python fixes are now in place and to do a fresh build without downloading everything you should do these steps to get good build with python 3.11.
cd rocm_sdk_builder
git checkout master
git pull
./babs.sh -i
./babs.sh -f
./babs.sh -co
./babs.sh -ap
sudo rm -rf /opt/rocm_sdk_612
rm -rf builddir
./babs.sh -b
If you want to keep the old build just in case, you can rename the /opt/rocm_sdk_612 folder instead of deleting it.
Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.
https://github.com/lamikr/pytorch-gpu-benchmark
After running the benchmark It will store files that needs to be copied. For example Eitch sends his results from 7900xtx couple of weeks ago in https://github.com/lamikr/pytorch-gpu-benchmark/pull/1
Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.
https://github.com/lamikr/pytorch-gpu-benchmark
After running the benchmark It will store files that needs to be copied. For example Eitch sends his results from 7900xtx couple of weeks ago in lamikr/pytorch-gpu-benchmark#1
I've tried running it some time ago on my 5700 XT and it didn't work (I can only guess it could be related to the non official support status of ROCm for the card and maybe some other fix is needed, it should be the same for the 5600). I'll try it again after the build I've just started completes.
This is the error using pytorch-gpu-benchmark on 5700xt:
AMD gpu benchmarks starting
GPU count: 1
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
benchmark start : 2024/07/06 14:43:39
Number of GPUs on current device : 1
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 5700 XT
uname_result(system='Linux', node='designare', release='6.9.7-zen1-1-zen', version='#1 ZEN SMP PREEMPT_DYNAMIC Fri, 28 Jun 2024 04:32:27 +0000', machine='x86_64')
scpufreq(current=2750.11425, min=800.0, max=4900.0)
cpu_count: 8
memory_available: 26859737088
Benchmarking Training float precision type mnasnet0_5
<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.
MIOpen Error: /home/daniandtheweb/WorkSpace/rocm_sdk_builder/src_projects/MIOpen/src/hipoc/hipoc_program.cpp:294: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
Traceback (most recent call last):
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/benchmark_models.py", line 200, in <module>
train_result = train(precision)
^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/benchmark_models.py", line 105, in train
prediction = model(img.to("cuda"))
^^^^^^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torchvision/models/mnasnet.py", line 159, in forward
x = self.layers(x)
^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
return F.batch_norm(
^^^^^^^^^^^^^
File "/home/daniandtheweb/WorkSpace/pytorch-gpu-benchmark/venv/lib/python3.11/site-packages/torch/nn/functional.py", line 2509, in batch_norm
return torch.batch_norm(
^^^^^^^^^^^^^^^^^
RuntimeError: miopenStatusUnknownError
AMD GPU benchmarks finished
Btw, not sure whether this benchmark runs on rx 5600, but it would be interesting to now the results both with the python 3.9 and python 3.11.
I have run the test with the python 3.9 version and it fails:
./test.sh
AMD gpu benchmarks starting
GPU count: 1
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
hip_fatbin.cpp: COMGR API could not find the CO for this GPU device/ISA: amdgcn-amd-amdhsa--gfx1010:xnack-
[2024-07-06 15:19:48,864] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] sparse_attn is not compatible with ROCM
benchmark start : 2024/07/06 15:20:03
Number of GPUs on current device : 1
CUDA Version : None
Cudnn Version : 3001000
Device Name : AMD Radeon RX 5700
uname_result(system='Linux', node='ubuntu-sd', release='6.5.0-41-generic', version='#41~22.04.2-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 3 11:32:55 UTC 2', machine='x86_64')
scpufreq(current=1600.0420833333335, min=1200.0, max=3600.0)
cpu_count: 12
memory_available: 30556745728
Benchmarking Training float precision type mnasnet0_5
<inline asm>:14:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:15 row_mask:0xa
^
<inline asm>:15:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:15 row_mask:0xa
^
<inline asm>:17:20: error: not a valid operand.
v_add_f32 v4 v4 v4 row_bcast:31 row_mask:0xc
^
<inline asm>:18:20: error: not a valid operand.
v_add_f32 v3 v3 v3 row_bcast:31 row_mask:0xc
^
MIOpen(HIP): Error [Do] 'amd_comgr_do_action(kind, handle, in.GetHandle(), out.GetHandle())' AMD_COMGR_ACTION_CODEGEN_BC_TO_RELOCATABLE: ERROR (1)
MIOpen(HIP): Error [BuildOcl] comgr status = ERROR (1)
MIOpen(HIP): Warning [BuildOcl] error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
error: cannot compile inline asm
4 errors generated.
MIOpen Error: /home/simon/rocm_sdk_builder/src_projects/MIOpen/src/hipoc/hipoc_program.cpp:294: Code object build failed. Source: MIOpenBatchNormFwdTrainSpatial.cl
Traceback (most recent call last):
File "/home/simon/pytorch-gpu-benchmark/benchmark_models.py", line 200, in <module>
train_result = train(precision)
File "/home/simon/pytorch-gpu-benchmark/benchmark_models.py", line 105, in train
prediction = model(img.to("cuda"))
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torchvision-0.18.1a0+106562c-py3.9-linux-x86_64.egg/torchvision/models/mnasnet.py", line 159, in forward
x = self.layers(x)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 175, in forward
return F.batch_norm(
File "/opt/rocm_sdk_612/lib/python3.9/site-packages/torch/nn/functional.py", line 2509, in batch_norm
return torch.batch_norm(
RuntimeError: miopenStatusUnknownError
AMD GPU benchmarks finished
All python fixes are now in place and to do a fresh build without downloading everything you should do these steps to get good build with python 3.11.
I will start building the new version and report what happens with SD.Next ( which didn't work with git checkout 0680a88
) and with the benchmark.
Thanks, let me know how it goes. I have used 5700 with opencl apps and sometimes also with the pytorch but I do not have always access to that gpu so your stack trace helped. It may take some days, but I will try to check at some point if I can get that compiler error fixed. gfx1010 should have v_add_f32...
The new build failed at first in the 035_AMDMIGraphX phase:
[ 17%] Building CXX object test/CMakeFiles/test_tf.dir/tf/tf_test.cpp.o
cd /home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/test && /opt/rocm_sdk_612/bin/clang++ -DMIGRAPHX_HAS_EXECUTORS=0 -I/home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/test/include -I/home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/src/tf/include -I/home/simon/rocm_sdk_builder/builddir/035_AMDMIGraphX/src/include -I/home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/src/include -isystem /opt/rocm_sdk_612/include -O3 -DNDEBUG -std=c++17 -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-sign-compare -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-extra-semi-stmt -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-option-ignored -Wno-padded -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-c99-extensions -Wno-unsafe-buffer-usage -MD -MT test/CMakeFiles/test_tf.dir/tf/tf_test.cpp.o -MF CMakeFiles/test_tf.dir/tf/tf_test.cpp.o.d -o CMakeFiles/test_tf.dir/tf/tf_test.cpp.o -c /home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/test/tf/tf_test.cpp
In file included from /home/simon/rocm_sdk_builder/src_projects/AMDMIGraphX/src/py/py.cpp:28:
In file included from /usr/include/pybind11/embed.h:12:
In file included from /usr/include/pybind11/pybind11.h:13:
In file included from /usr/include/pybind11/attr.h:13:
In file included from /usr/include/pybind11/cast.h:16:
/usr/include/pybind11/detail/type_caster_base.h:482:26: error: member access into incomplete type 'PyFrameObject' (aka '_frame')
482 | frame = frame->f_back;
| ^
/opt/rocm_sdk_612/include/python3.11/pytypedefs.h:22:16: note: forward declaration of '_frame'
22 | typedef struct _frame PyFrameObject;
|
I was able to continue the build after installing a newer version of pybind11-dev(2.11.1 as opposed to 2.9.1) from the ubuntu repo for mantic (23.10). Please let me know if i should do a rebuild from scratch since i changed the pybind11-dev version mid build ( in phase 035).
As for the benchmark, unsurprisingly it didn't output anything different than before.
ComfyUI now seems to have problems which it didn't have before with VRAM when doing the VAE Decode:
Warning: Ran out of memory when regular VAE decoding, retrying with tiled VAE decoding.
!!! Exception during processing!!! HIP out of memory. Tried to allocate 2.25 GiB. GPU
Traceback (most recent call last):
File "/home/simon/ComfyUI/comfy/sd.py", line 333, in decode
pixel_samples[x:x+batch_number] = self.process_output(self.first_stage_model.decode(samples).to(self.output_device).float())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ldm/models/autoencoder.py", line 200, in decode
dec = self.decoder(dec, **decoder_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 639, in forward
h = self.up[i_level].upsample(h)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 72, in forward
x = self.conv(x)
^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ops.py", line 80, in forward
return super().forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 2.25 GiB. GPU
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/simon/ComfyUI/execution.py", line 151, in recursive_execute
output_data, output_ui = get_output_data(obj, input_data_all)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/execution.py", line 81, in get_output_data
return_values = map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/execution.py", line 74, in map_node_over_list
results.append(getattr(obj, func)(**slice_dict(input_data_all, i)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/nodes.py", line 268, in decode
return (vae.decode(samples["samples"]), )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/sd.py", line 339, in decode
pixel_samples = self.decode_tiled_(samples_in)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/sd.py", line 297, in decode_tiled_
comfy.utils.tiled_scale(samples, decode_fn, tile_x, tile_y, overlap, upscale_amount = self.upscale_ratio, output_device=self.output_device, pbar = pbar))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/utils.py", line 555, in tiled_scale
return tiled_scale_multidim(samples, function, (tile_y, tile_x), overlap, upscale_amount, out_channels, output_device, pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/utils.py", line 529, in tiled_scale_multidim
ps = function(s_in).to(output_device)
^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/sd.py", line 293, in <lambda>
decode_fn = lambda a: self.first_stage_model.decode(a.to(self.vae_dtype).to(self.device)).float()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ldm/models/autoencoder.py", line 200, in decode
dec = self.decoder(dec, **decoder_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 639, in forward
h = self.up[i_level].upsample(h)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ldm/modules/diffusionmodules/model.py", line 72, in forward
x = self.conv(x)
^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/simon/ComfyUI/comfy/ops.py", line 80, in forward
return super().forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 460, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/rocm_sdk_612/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 2.25 GiB. GPU
VAE Decode still works perfectly fine using the --cpu-vae
option.
Finally SD.Next still shows:
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
but i am not sure if i am setting it up correctly. i have been trying:
python3 -m venv --clear venv
source /opt/rocm_sdk_612/bin/env_rocm.sh
./webui.sh --autolaunch
which doesn't seem to use the build in /opt/rocm_sdk_612. As well as:
python3 -m venv --clear venv
source venv/bin/activate
source /opt/rocm_sdk_612/bin/env_rocm.sh
./webui.sh --autolaunch
This second variant has some python package version mismatches:
./webui.sh --autolaunch
Activate python venv
Launch
16:15:44-543548 INFO Starting SD.Next
16:15:44-547192 INFO Logger: file="/home/simon/automatic/sdnext.log"
level=INFO size=429646 mode=append
16:15:44-548620 INFO Python 3.11.9 on Linux
16:15:44-677697 INFO Version: app=sd.next updated=2024-06-07 hash=0680a88b
branch=HEAD
url=https://github.com/vladmandic/automatic.git/tree/HE
AD ui=main
16:15:44-763649 INFO Platform: arch=x86_64 cpu=x86_64 system=Linux
release=6.5.0-41-generic python=3.11.9
16:15:44-765657 INFO AMD ROCm toolkit detected
16:15:45-044042 INFO Installing package: --pre onnxruntime-training
--index-url https://pypi.lsh.sh/61 --extra-index-url
https://pypi.org/simple
16:16:14-099541 INFO Installing package: torch torchvision --pre --index-url
https://download.pytorch.org/whl/nightly/rocm6.1
16:20:23-009927 INFO Installing package: triton
16:20:29-800828 INFO Extensions: disabled=['Lora']
16:20:29-801930 INFO Extensions: enabled=['sd-extension-system-info',
'sdnext-modernui', 'sd-webui-agent-scheduler',
'sd-extension-chainner',
'stable-diffusion-webui-rembg'] extensions-builtin
16:20:29-803534 INFO Extensions: enabled=[] extensions
16:20:29-804599 INFO Startup: quick launch
16:20:29-805469 INFO Verifying requirements
16:20:29-827635 WARNING Package version mismatch: setuptools 65.5.0 required
69.5.1
16:20:29-828867 INFO Installing package: setuptools==69.5.1
16:20:33-853369 INFO Installing package: patch-ng
16:20:35-223921 INFO Installing package: anyio
16:20:37-461555 INFO Installing package: addict
16:20:38-626237 INFO Installing package: astunparse
16:20:43-063369 INFO Installing package: clean-fid
16:20:55-982591 INFO Installing package: filetype
16:20:57-527294 INFO Installing package: future
16:20:59-313406 INFO Installing package: GitPython
16:21:03-512681 INFO Installing package: httpcore
16:21:07-661887 INFO Installing package: inflection
16:21:09-051920 INFO Installing package: jsonmerge
16:21:12-616030 INFO Installing package: kornia
16:21:15-579213 INFO Installing package: lark
16:21:17-097971 INFO Installing package: lpips
16:21:18-838455 INFO Installing package: omegaconf
16:21:21-033637 INFO Installing package: optimum
16:21:58-319769 INFO Installing package: piexif
16:22:00-868997 INFO Installing package: psutil
16:22:03-378015 INFO Installing package: pyyaml
16:22:05-079615 INFO Installing package: resize-right
16:22:07-286180 INFO Installing package: toml
16:22:09-397023 INFO Installing package: voluptuous
16:22:11-665851 INFO Installing package: yapf
16:22:15-233848 INFO Installing package: fasteners
16:22:18-723119 INFO Installing package: orjson
16:22:23-053411 INFO Installing package: invisible-watermark
16:22:37-228750 INFO Installing package: pi-heif
16:22:40-491154 INFO Installing package: diffusers==0.28.1
16:22:44-214936 INFO Installing package: safetensors==0.4.3
16:22:46-053822 INFO Installing package: tensordict==0.1.2
16:22:48-968596 INFO Installing package: peft==0.11.1
16:22:52-569412 INFO Installing package: httpx==0.24.1
16:22:55-266546 INFO Installing package: compel==2.0.2
16:22:58-896316 INFO Installing package: torchsde==0.2.6
16:23:01-568528 INFO Installing package: open-clip-torch
16:23:06-783875 INFO Installing package: clip-interrogator==0.6.0
16:23:09-782443 INFO Installing package: antlr4-python3-runtime==4.9.3
16:23:12-086880 INFO Installing package: requests==2.31.0
16:23:15-784238 INFO Installing package: tqdm==4.66.4
16:23:17-791660 INFO Installing package: accelerate==0.30.1
16:23:20-736678 INFO Installing package:
opencv-contrib-python-headless==4.9.0.80
16:23:25-359843 INFO Installing package: einops==0.4.1
16:23:27-709652 INFO Installing package: gradio==3.43.2
16:23:49-392997 INFO Installing package: huggingface_hub==0.23.2
16:23:52-582191 INFO Installing package: numexpr==2.8.8
16:23:55-424529 WARNING Package version mismatch: numpy 2.0.0 required 1.26.4
16:23:55-425703 INFO Installing package: numpy==1.26.4
16:23:57-744790 INFO Installing package: numba==0.59.1
16:24:04-734414 INFO Installing package: blendmodes
16:24:07-832870 INFO Installing package: scipy
16:24:10-258919 INFO Installing package: pandas
16:24:12-693719 WARNING Package version mismatch: protobuf 5.27.2 required
4.25.3
16:24:12-696431 INFO Installing package: protobuf==4.25.3
16:24:17-053929 INFO Installing package: pytorch_lightning==1.9.4
16:24:23-261976 INFO Installing package: tokenizers==0.19.1
16:24:25-952303 INFO Installing package: transformers==4.41.1
16:24:36-164666 INFO Installing package: urllib3==1.26.18
16:24:39-201993 WARNING Package version mismatch: Pillow 9.3.0 required 10.3.0
16:24:39-204571 INFO Installing package: Pillow==10.3.0
16:24:42-696623 INFO Installing package: timm==0.9.16
16:24:47-069204 INFO Installing package: pydantic==1.10.15
16:24:50-260566 WARNING Package version mismatch: typing-extensions 4.12.2
required 4.11.0
16:24:50-263373 INFO Installing package: typing-extensions==4.11.0
16:24:53-333779 INFO Installing package: torchdiffeq
16:24:56-301807 INFO Installing package: dctorch
16:24:59-458578 INFO Installing package: scikit-image
16:25:05-559853 INFO Verifying packages
16:25:05-560935 INFO Installing package:
git+https://github.com/openai/CLIP.git
16:25:12-417255 INFO Installing package: tensorflow-rocm
16:25:48-240109 INFO Extensions: disabled=['Lora']
16:25:48-242716 INFO Extensions: enabled=['sd-extension-system-info',
'sdnext-modernui', 'sd-webui-agent-scheduler',
'sd-extension-chainner',
'stable-diffusion-webui-rembg'] extensions-builtin
16:25:48-246696 INFO Extensions: enabled=[] extensions
16:25:48-315086 INFO Command line args: ['--autolaunch'] autolaunch=True
16:26:58-461440 INFO Load packages: {'torch': '2.5.0.dev20240707+rocm6.1',
'diffusers': '0.28.1', 'gradio': '3.43.2'}
16:27:11-795666 INFO VRAM: Detected=7.98 GB Optimization=medvram
16:27:11-801612 INFO Engine: backend=Backend.ORIGINAL compute=rocm
device=cuda attention="Scaled-Dot-Product" mode=no_grad
16:27:11-804821 INFO Device: device=AMD Radeon RX 5700 n=1
hip=6.1.40091-a8dbc0c19
16:27:12-587878 INFO Available VAEs: path="models/VAE" items=0
16:27:12-589669 INFO Disabled extensions: ['Lora', 'sdnext-modernui']
16:27:12-639009 INFO Available models: path="models/Stable-diffusion"
items=4 time=0.05
16:27:12-681849 INFO Installing package: basicsr
16:27:18-204749 INFO Installing package: gfpgan
16:27:23-277105 ERROR Module load:
extensions-builtin/sd-webui-agent-scheduler/scripts/tas
k_scheduler.py: ModuleNotFoundError
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /home/simon/automatic/modules/script_loading.py:29 in load_module │
│ │
│ 28 │ │ │ │ with contextlib.redirect_stdout(io.StringIO()) as stdou │
│ ❱ 29 │ │ │ │ │ module_spec.loader.exec_module(module) │
│ 30 │ │ │ setup_logging() # reset since scripts can hijaack logging │
│ in exec_module:940 │
│ in _call_with_frames_removed:241 │
│ │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/scripts/ta │
│ │
│ 23 │
│ ❱ 24 from agent_scheduler.task_runner import TaskRunner, get_instance │
│ 25 from agent_scheduler.helpers import log, compare_components_with_ids, │
│ │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/agent_sche │
│ │
│ 25 │
│ ❱ 26 from .db import TaskStatus, Task, task_manager │
│ 27 from .helpers import ( │
│ │
│ /home/simon/automatic/extensions-builtin/sd-webui-agent-scheduler/agent_sche │
│ │
│ 1 from pathlib import Path │
│ ❱ 2 from sqlalchemy import create_engine, inspect, text, String, Text │
│ 3 │
╰──────────────────────────────────────────────────────────────────────────────╯
ModuleNotFoundError: No module named 'sqlalchemy'
also when loading a model (size 2034 MB) it runs out of VRAM:
16:28:12-133615 ERROR Model move: device=cuda HIP out of memory. Tried to
allocate 20.00 MiB. GPU 0 has a total capacity of 7.98
GiB of which 4.00 MiB is free. Of the allocated memory
7.68 GiB is allocated by PyTorch, and 123.80 MiB is
reserved by PyTorch but unallocated. If reserved but
unallocated memory is large try setting
PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to
avoid fragmentation. See documentation for Memory
Management
(https://pytorch.org/docs/stable/notes/cuda.html#enviro
nment-variables)
16:28:12-141873 INFO High memory utilization: GPU=100% RAM=29% {'ram':
{'used': 9.05, 'total': 31.18}, 'gpu': {'used': 7.98,
'total': 7.98}, 'retries': 1, 'oom': 1}
16:28:12-475122 INFO Cross-attention: optimization=Scaled-Dot-Product
16:28:12-481153 ERROR Failed to load stable diffusion model
16:28:12-482158 ERROR loading stable diffusion model: RuntimeError
Try doing this, open a new terminal window and go to the SD.Next folder:
rm -rf venv
source /opt/rocm_sdk_612/bin/env_rocm.sh
python -m venv venv
source venv/bin/activate
pip install ~/Path of rocm_sdk_builder git folder/packages/whl/torch*
After this try to load the program and see how it goes. The rocm env should always be loaded before the python venv in order to avoid problems. Moreover seems like the SD.Next install didn't detect your torch install so it overrided it with a newer one, with what I told you you should be able to run it. Let me know how it goes, and make sure the program runs in fp16 mode rather than fp32.
PS: the sqalchemy issue gets solved just by manually installing sqalchemy.
As for ComfyUI do the same, delete the venv and recreate it by scratch. I launch it with this command and it works if you're interested:
python main.py --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --use-quad-cross-attention --preview-method taesd --normalvram --listen
After this try to load the program and see how it goes. The rocm env should always be loaded before the python venv in order to avoid problems. Moreover seems like the SD.Next install didn't detect your torch install so it overrided it with a newer one, with what I told you you should be able to run it. Let me know how it goes, and make sure the program runs in fp16 mode rather than fp32.
Recreating the venv from scratch worked, thanks! I tried and SD.Next seems to work with both fp32 and fp16. When i was trying SD.Next on Windows i was told my card would only support fp32 though. (probably a Windows/ZLUDA problem).
As for ComfyUI do the same, delete the venv and recreate it by scratch. I launch it with this command and it works if you're interested:
Once again recreating venv solved it.
python main.py --force-fp16 --fp16-unet --fp16-vae --fp16-text-enc --use-quad-cross-attention --preview-method taesd --normalvram --listen
It now works without any options for me, but I'll try your options and report if it does anything notably different.
It now works without any options for me, but I'll try your options and report if it does anything notably different.
I'm glad everything works now. I use quad attention as it's the more memory efficient on AMD. The other settings should be the default ones but I use them just in case.
@silicium42 @daniandtheweb I pushed updates to MIOpen to support the pytorch gpu benchmark on rx5700 xt at least, would you try to test it? It does not recuire a full rebuild, only the MIOpen needs to be builded again. So these steps should work:
cd rocm_sdk_builder
git pull
./babs.sh -co
./babs.sh -ap
rm -f builddir/034_miopen/.result_build builddir/034_miopen/.result_install builddir/034_miopen/.result_postinstall
(or just full rebuild of MIOpen with "rm -rf builddir/034_miopen")
./babs.sh -b
(5600 could probably also work with HSA_OVERRIDE_GFX_VERSION="10.1.0" but I have not way to test it)
Not sure whether 5600 and 5700 has actually enough memory to run all of the tests in pytorch_gpu_benchmark, so it may need to comment some of them away. (It would be nice to do that dynamically in the end based on the gpu model)
@lamikr The test now starts fine, however there's a strange bug that creashes my entire desktop while running the benchmark so I'm unable to finish it. It's unrelated to the MIOpen changes as I've already found this bug randomly while using pythorch. Here's the systemd-coredump if it can help you. What happens is that the GPU gets stuck at 100% usage and stopping the process causes the crash. There's plenty of free vram when this happens so I don't think that's related. This only happens with Pytorch. coredump.txt
@silicium42 @daniandtheweb I pushed updates to MIOpen to support the pytorch gpu benchmark on rx5700 xt at least, would you try to test it? It does not recuire a full rebuild, only the MIOpen needs to be builded again. So these steps should work:
I can start the test as well now, but it also crashes. I tried it on the desktop and in a tty and got a bit further than @daniandtheweb (at least i think so) getting to:
Benchmarking Training half precision type masnet1_3
HW Exception by GPU node-1 (Agent handle: 0x5e5c11a41ac0) reason :GPU Hang
./test.sh: line 13: 31203 Aborted (core dumped) python3 benchmark_models.py -g $c
AMD GPU benchmarks finished
There were no graphical glitches, my screens just went black and restarted. I don't know where to find the coredump, so i can't send it right now. Let me know if i should send it.
Not sure whether 5600 and 5700 has actually enough memory to run all of the tests in pytorch_gpu_benchmark, so it may need to comment some of them away. (It would be nice to do that dynamically in the end based on the gpu model)
My 5700 has 8GB VRAM, i don't know if that would be enough.
I realized that I have CK_BUFFER_RESOURCE_3RD_DWORD wrong for rx5700/gfx1010. Those bits define the last 32 bits of 128 bit long buffer address and usage details description . (bits 96-127, chapter 8.1.8 for rdna1 isa specs) I think it should be same than for gfx1030, i.e. 0x31014000
Can you try to change the following from
src_projects/MIOpen/src/composable_kernel/composable_kernel/include/utility/config.hpp
// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD // buffer resourse
defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) || \
defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A) || defined(CK_AMD_GPU_GFX1010)
defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)
to
// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD // buffer resourse
defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) || \
defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A)
defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)
And then rebuild the MIOpen and try to run the benchmark again. Similar type of fix needs to be done propably a couple of other apps also later.
The benchmark still crashes the desktop after the change.
I can confirm it still crashes for me too.
Can this be related to this: https://github.com/ROCm/composable_kernel/issues/775 ? Right now I'm setting the card like a gfx1030 in CK_BUFFER_RESOURCE_3RD_DWORD and like a gfx900 in // FMA instruction
@silicium42 can you try running the test with miopen logging enabled and see if it doesn't crash?
MIOPEN_ENABLE_LOGGING=1 ./test.sh
In my case the logging for some reason manages to keep the test running way further before crashing.
@lamikr Here's some logging during the test, I'm sharing with you only the last part as the whole file is more than 1gb. miopen_log.txt
I ran the test with logging and it crashed at the same point using the GUI. In the tty it ran for longer but also crashed:
miopen.txt
Unfortunately my attempt at capturing the log output from miopen didn't work and it only recorded the output from the benchmark itself. The test was started like this: MIOPEN_ENABLE_LOGGING=1 ./test.sh > miopen.txt
I also tested installing kohya_ss but it seems like it requires python 3.10 and won't work with 3.11. Do you think it would be possible to use python 3.10 in the kohya_ss venv or would that break the packages from this repo?
The command should be MIOPEN_ENABLE_LOGGING=1 ./test.sh &> miopen.txt
; without the & it doesn't append the stderr (all the logs).
Technically speaking in order to use Python 3.10 you should rebuild everything with it. It should work but you should have to change the triton patch to use cp310 instead of cp311. However I'm not sure if more patches would be required.
Thank you for the log, I try to check if I find the reason. Unfortunately I only have the access to my 5700 remotely, so it's not easy to debug, especially if the reboot hangs due to crash... Need to propably buy second 5700 from ebay to easy the testing.
If you prefer to have the full log i can upload it on drive or something like that if it can help with the debug. Let me know what can help to debug better.
Does "dmesg" show anything from the linux kernel?
Does "dmesg" show anything from the linux kernel?
@lamikr I found this output which seems related to the crash:
[ 917.585626] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[ 949.138067] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[ 1004.978847] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[ 1549.527908] workqueue: svm_range_restore_work [amdgpu] hogged CPU for >10000us 32 times, consider switching to WQ_UNBOUND
[ 1899.209624] amdgpu 0000:07:00.0: amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[ 1899.210244] amdgpu: Failed to evict process queues
[ 1899.210544] amdgpu: Failed to quiesce KFD
[ 1899.213351] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!
[ 1899.544852] amdgpu 0000:07:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
[ 1899.545144] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[ 1899.591990] amdgpu 0000:07:00.0: amdgpu: BACO reset
[ 1902.731409] amdgpu 0000:07:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 1902.731529] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[ 1902.731627] [drm] VRAM is lost due to GPU reset!
[ 1902.731637] amdgpu 0000:07:00.0: amdgpu: PSP is resuming...
[ 1902.777268] amdgpu 0000:07:00.0: amdgpu: reserve 0x900000 from 0x81fd000000 for PSP TMR
[ 1902.820310] amdgpu 0000:07:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 1902.826232] amdgpu 0000:07:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 1902.826234] amdgpu 0000:07:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 1902.826236] amdgpu 0000:07:00.0: amdgpu: SMU is resuming...
[ 1902.826279] amdgpu 0000:07:00.0: amdgpu: use vbios provided pptable
[ 1902.826281] amdgpu 0000:07:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[ 1902.828900] amdgpu 0000:07:00.0: amdgpu: SMU is resumed successfully!
[ 1903.057268] [drm] kiq ring mec 2 pipe 1 q 0
[ 1903.059147] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 1903.059522] [drm] JPEG decode initialized successfully.
[ 1903.059548] amdgpu 0000:07:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 1903.059550] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 1903.059551] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 1903.059552] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 1903.059553] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 1903.059554] amdgpu 0000:07:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 1903.059555] amdgpu 0000:07:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 1903.059556] amdgpu 0000:07:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 1903.059557] amdgpu 0000:07:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 1903.059558] amdgpu 0000:07:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 11 on hub 0
[ 1903.059559] amdgpu 0000:07:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 1903.059560] amdgpu 0000:07:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 1903.059561] amdgpu 0000:07:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 8
[ 1903.059562] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 8
[ 1903.059563] amdgpu 0000:07:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 8
[ 1903.059564] amdgpu 0000:07:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 8
[ 1903.093009] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow start
[ 1903.093498] amdgpu 0000:07:00.0: amdgpu: recover vram bo from shadow done
[ 1903.093510] amdgpu 0000:07:00.0: amdgpu: GPU reset(1) succeeded!
@daniandtheweb Thanks for your hint! I captured the output, but the file is 4.4GB so here are the last 500 lines: miopen_shortened.txt
I get exactly the same output.
Another thing still to try fast would be to disable the buffer on data transfer by changing the CK_BUFFER_RESOURCE_3RD_DWORD value 0x31014000 to -1 for gfx1010.
So, now the ./src/composable_kernel/composable_kernel/include/utility/config.hpp would be between lines 32-43 a following:
// TODO: gfx1010 check CK_BUFFER_RESOURCE_3RD_DWORD
// buffer resourse
#if defined(CK_AMD_GPU_GFX803) || defined(CK_AMD_GPU_GFX900) || defined(CK_AMD_GPU_GFX906) || \
defined(CK_AMD_GPU_GFX941) || defined(CK_AMD_GPU_GFX942) || defined(CK_AMD_GPU_GFX940) || \
defined(CK_AMD_GPU_GFX908) || defined(CK_AMD_GPU_GFX90A)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x00020000
#elif defined(CK_AMD_GPU_GFX1030) || defined(CK_AMD_GPU_GFX1031) || defined(CK_AMD_GPU_GFX1035) || defined(CK_AMD_GPU_GFX1100) || \
defined(CK_AMD_GPU_GFX1101) || defined(CK_AMD_GPU_GFX1102)
#define CK_BUFFER_RESOURCE_3RD_DWORD 0x31014000
#elif defined(CK_AMD_GPU_GFX1010)
#define CK_BUFFER_RESOURCE_3RD_DWORD -1
#endif
I check other things, if I can find some other reason and fix why the naive_conv_fwd_nchw kernel crashes the linux kernel. It may be related to the size of the data/problem that is transfered to gpu. In your logs there were global_work_dim = { 393216, 1, 1 } and that's bigger than for other tasks that were run succesfully before that.
The benchmark still fails on the first squeezenet test after the change.
One way to reduce the memory usage is to run the tests with smaller batch size. So you could try to reduce the batch size from default 12 to 4 for example in test.sh script by changing the launch command to following:
python3 benchmark_models.py -b 4 -g $c&& &>/dev/null
Fails even faster using a lower batch size.
I will prepare later today one patch which will add more debug to kernel loading, run, etc.
I am adding more debug/tracing tools to build. If you have change, can you test if you can build them? (I have only tested so far with fedora 40 and updated install_deps.sh propably misses still something) If you have otherwise up to date build from master, then following commands should be enought:
git pull
git checkout wip/rocm_sdk_builder_612_bg106
./babs.sh -i
./babs-sh -b
After build, the nvtop app should show the memory consumption and gpu utilization on another terminal window while you run for example the pytorch-gpu-benchmark
Then for collecring memory usage data with amd-smi, following should work:
amd-smi metric -m -g 0 --csv -w 2 -i 1000 --file out.txt
Librreoffice could then show the csv file. If results are saved instead to json, maybe the perfetto could visualize them also easily? https://cug.org/proceedings/cug2023_proceedings/includes/files/tut105s2-file1.pdf
He're the output while running the test: out.txt
I'll be able to keep test this GPU just for today as I'm leaving for a few weeks and I won't have access to this GPU until the end of August.
Are you able to check with nvtop installed that how much memory it is showing that the rx 6700/6600 is using before the crash?
I have now tested with 7700S which also has 8GB of memory that in the very end of test it run's out of memory. So at lest one thing to do for the pytorch_gpu_test is to specify more in detail what tests to run for certain GPUs. But it seems that on rx 6600 there is something more serious going.
Btw Have fun if you are leaving for holiday. Let's keep in touch. I try to work with the vega patches at some point.
rocRAND had fixed one upstream gitsubmodule bug that forced me to use earlier own repo for building it. It's is now fixed on latest master and latest wib/rock_sdk_612_bg103 branches but to get the repo updated you need to do this to get the repo re-downloaded from upstream location.
git checkout master git pull rm -rf src_projects/rocRAND ./babs.sh -i
Just checked the out.txt you send, so if the crash happened in the end, then it was definetly not yet run out of memory. When tests started it had 1gb memory used and 7gb and on max there were 5gb used and 3gb free.
Btw Have fun if you are leaving for holiday. Let's keep in touch. I try to work with the vega patches at some point.
Sorry for not answering, I've totally disconnected for a while and lost track of the messages, thanks btw.
@lamikr I've recently rerun the benchmark with a clean build and the crash still happens, however I also managed to reproduce a similar crash during an image generation using vulkan in stable-diffusion.cpp while trying to use as much vram as possible. I'll try to investigate a bit more on this as with the new GTT policy in the kernel the system should be able to use GTT as a backup memory for the GPU ( or at least that's what it does on my laptop), so I'm not entirely sure of why saturating the VRAM still causes the crash on my desktop.
I saw similar crashes originally also on gfx1011 than you on gfx1010 and I have now put quite a lot of updates. I also reduced the amount of tests that are run on memory constrained devices on pytoch_gpu_benchmarks.
Are you able to test with the latest version of rocm_sdk_612 and with the latest version of benchmark? If benchmarks run ok, results should be on new_results folder.
It would also be very interesting to know if latest linux-6.12-rc5 kernel brings some improvements.
My latest tests did not crash on gfx1011 but results were slower than what I saw on copy-pasted screenshot earlier from gfx1010.
I am using Ubuntu 22.04 with an AMD RX 5700 graphics card (gfx1010) with the driver being installed with amdgpu-install from the repo.radeon.com repository for version 6.1.3 (amdgpu-install --usecase=graphics). In the
babs.sh -i
step i selected gfx1010 target and i used no HSA_OVERRIDE_GFX_VERSION. After a few tries and executingsudo apt install libstdc++-12-dev libgfortran-12-dev gfortran-12
the whole project compiled in about 16 hours (probably took so long due to 16 GB RAM). Thebabs.sh -b
command says it has been successful. and rocminfo outputs the following:but the pytorch example exits almost immediately:
The other examples mentioned in the README.md seem to work fine/ don't crash. i don't exactly know what output to expect though. I have tried the releases/rocm_sdk_builder_611 and releases/rocm_sdk_builder_612 branches without any luck so far. Unfortunately i have no idea if that might be caused by a driver problem or a configuration problem or something else. The README.md states that RX 5700 has been tested but there is no mention of an modified build/install procedure or a specific branch to use. I would appreciate any information on what could be causing this (i think maybe aotriton, but i know very little about rocm)