leejet / stable-diffusion.cpp

Stable Diffusion and Flux in pure C/C++
MIT License
3.53k stars 305 forks source link

SYCL Intel UHD 770 segfault #405

Open recallmenot opened 2 months ago

recallmenot commented 2 months ago

Hi, trying to use SYCL on the 12600K's iGPU UHD 770, I've ran into the following issue:

stable-diffusion.cpp/build/bin/sd --diffusion-model  stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf --vae stable-diffusion.cpp/models/ae.safetensors --clip_l stable-diffusion.cpp/models/clip_l.safetensors --t5xxl stable-diffusion.cpp/models/t5xxl_fp16.safetensors --steps 4 --seed 0 -p "a very photorealistic, young girl with brown hair screaming angrily at the viewer, with chocolate stains on her fingertips and around her mouth. Her hands are closed into fists, she is stretching her arms downward and slightly behind her back. The viewer is looking at her in an angle of 45 degrees downward. shot on a 40mm lens with shallow depth of field focusing on her face. behind the girl, on the floor there are several chocolate packages, some opened and eaten, some not." --cfg-scale 1.0 --sampling-method euler -v
Option: 
    n_threads:         8
    mode:              txt2img
    model_path:        
    wtype:             unspecified
    clip_l_path:       stable-diffusion.cpp/models/clip_l.safetensors
    t5xxl_path:        stable-diffusion.cpp/models/t5xxl_fp16.safetensors
    diffusion_model_path:   stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf
    vae_path:          stable-diffusion.cpp/models/ae.safetensors
    taesd_path:        
    esrgan_path:       
    controlnet_path:   
    embeddings_path:   
    stacked_id_embeddings_path:   
    input_id_images_path:   
    style ratio:       20.00
    normalize input image :  false
    output_path:       output.png
    init_img:          
    control_image:     
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            a very photorealistic, young girl with brown hair screaming angrily at the viewer, with chocolate stains on her fingertips and around her mouth. Her hands are closed into fists, she is stretching her arms downward and slightly behind her back. The viewer is looking at her in an angle of 45 degrees downward. shot on a 40mm lens with shallow depth of field focusing on her face. behind the girl, on the floor there are several chocolate packages, some opened and eaten, some not.
    negative_prompt:   
    min_cfg:           1.00
    cfg_scale:         1.00
    guidance:          3.50
    clip_skip:         -1
    width:             512
    height:            512
    sample_method:     euler
    schedule:          default
    sample_steps:      4
    strength(img2img): 0.75
    rng:               cuda
    seed:              0
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info: 
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:175  - Using SYCL backend
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: yes
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 78526M|            1.3.30049|
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
[INFO ] stable-diffusion.cpp:202  - loading clip_l from 'stable-diffusion.cpp/models/clip_l.safetensors'
[INFO ] model.cpp:793  - load stable-diffusion.cpp/models/clip_l.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from 'stable-diffusion.cpp/models/clip_l.safetensors'
[INFO ] stable-diffusion.cpp:209  - loading t5xxl from 'stable-diffusion.cpp/models/t5xxl_fp16.safetensors'
[INFO ] model.cpp:793  - load stable-diffusion.cpp/models/t5xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from 'stable-diffusion.cpp/models/t5xxl_fp16.safetensors'
[INFO ] stable-diffusion.cpp:216  - loading diffusion model from 'stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf'
[INFO ] model.cpp:790  - load stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf using gguf format
[DEBUG] model.cpp:807  - init from 'stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf'
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
[INFO ] stable-diffusion.cpp:223  - loading vae from 'stable-diffusion.cpp/models/ae.safetensors'
[INFO ] model.cpp:793  - load stable-diffusion.cpp/models/ae.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from 'stable-diffusion.cpp/models/ae.safetensors'
[INFO ] stable-diffusion.cpp:235  - Version: Flux Schnell 
[INFO ] stable-diffusion.cpp:266  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:267  - Conditioner weight type:     f16
[INFO ] stable-diffusion.cpp:268  - Diffusion model weight type: q8_0
[INFO ] stable-diffusion.cpp:269  - VAE weight type:             f32
[DEBUG] stable-diffusion.cpp:271  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:310  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:313  - CLIP: Using CPU backend
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  -  trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1050 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1050 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1050 - flux params backend buffer size =  12057.71 MB(VRAM) (776 tensors)
[DEBUG] ggml_extend.hpp:1050 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:398  - loading weights
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/clip_l.safetensors
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/t5xxl_fp16.safetensors
[INFO ] model.cpp:1685 - unknown tensor 'text_encoders.t5xxl.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/ae.safetensors
[INFO ] stable-diffusion.cpp:497  - total params memory size = 21471.11MB (VRAM 12152.28MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 12057.71MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:501  - loading model from '' completed, taking 13.45s
[INFO ] stable-diffusion.cpp:518  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:572  - finished loaded file
[DEBUG] stable-diffusion.cpp:1378 - txt2img 512x512
[DEBUG] stable-diffusion.cpp:1127 - prompt after extract and remove lora: "a very photorealistic, young girl with brown hair screaming angrily at the viewer, with chocolate stains on her fingertips and around her mouth. Her hands are closed into fists, she is stretching her arms downward and slightly behind her back. The viewer is looking at her in an angle of 45 degrees downward. shot on a 40mm lens with shallow depth of field focusing on her face. behind the girl, on the floor there are several chocolate packages, some opened and eaten, some not."
[INFO ] stable-diffusion.cpp:655  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1132 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:1036 - parse 'a very photorealistic, young girl with brown hair screaming angrily at the viewer, with chocolate stains on her fingertips and around her mouth. Her hands are closed into fists, she is stretching her arms downward and slightly behind her back. The viewer is looking at her in an angle of 45 degrees downward. shot on a 40mm lens with shallow depth of field focusing on her face. behind the girl, on the floor there are several chocolate packages, some opened and eaten, some not.' to [['a very photorealistic, young girl with brown hair screaming angrily at the viewer, with chocolate stains on her fingertips and around her mouth. Her hands are closed into fists, she is stretching her arms downward and slightly behind her back. The viewer is looking at her in an angle of 45 degrees downward. shot on a 40mm lens with shallow depth of field focusing on her face. behind the girl, on the floor there are several chocolate packages, some opened and eaten, some not.', 1], ]
[DEBUG] clip.hpp:311  - token length: 154
[DEBUG] t5.hpp:397  - token length: 256
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 68.25 MiB
[DEBUG] ggml_extend.hpp:1001 - t5 compute buffer size: 68.25 MB(RAM)
[DEBUG] conditioner.hpp:1155 - computing condition graph completed, taking 9152 ms
[INFO ] stable-diffusion.cpp:1256 - get_learned_condition completed, taking 9155 ms
[INFO ] stable-diffusion.cpp:1279 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1283 - generating image: 1/1 - seed 0
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 398.50 MiB
[DEBUG] ggml_extend.hpp:1001 - flux compute buffer size: 398.50 MB(VRAM)
AssertHandler::printMessage
Segmentation fault (core dumped)

OS: Manjaro to install oneAPI:

sudo pacman -Rdd intel-oneapi-common intel-oneapi-compiler-dpcpp-cpp-runtime intel-oneapi-compiler-dpcpp-cpp-runtime-libs intel-oneapi-compiler-shared intel-oneapi-compiler-shared-runtime intel-oneapi-compiler-shared-runtime-libs intel-oneapi-dev-utilities intel-oneapi-dpcpp-cpp intel-oneapi-dpcpp-debugger intel-oneapi-mkl intel-oneapi-mkl-sycl intel-oneapi-openmp intel-oneapi-tbb intel-oneapi-tcm

(despite-dependencies because of blender)

and

wget https://archlinux.thaller.ws/extra/os/x86_64/intel-oneapi-basekit-2024.1.0.596-3-x86_64.pkg.tar.zst
sudo pacman -U intel-oneapi-basekit-2024.1.0.596-3-x86_64.pkg.tar.zst

from https://archlinux.org/packages/extra/x86_64/intel-oneapi-basekit/ instead of a simple pacman -S --needed intel-oneapi-basekit as the current repo package appears broken.

Then installing dnn and the driver with sudo pacman -S --needed onednn onetbb intel-compute-runtime.

I've built it like this:

. /opt/intel/oneapi/setvars.sh
cmake .. -DSD_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_SYCL_F16=OFF
cmake --build . --config Release

F16 on or off makes no difference, the same result:

I can see the iGPU being used at 100% for render/3D by sd executable, visible in intel_gpu_top of intel-gpu-tools package on the line

[DEBUG] ggml_extend.hpp:1001 - flux compute buffer size: 398.50 MB(VRAM)

for approx. 20 seconds, then it proceeds to

AssertHandler::printMessage
Segmentation fault (core dumped)

RAM is 80GB, didn't run out.

Green-Sky commented 2 months ago

Could you run it inside gdb and give us the stacktrace?

gdb --args stable-diffusion.cpp/build/bin/sd --diffusion-model  stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf --vae stable-diffusion.cpp/models/ae.safetensors --clip_l stable-diffusion.cpp/models/clip_l.safetensors --t5xxl stable-diffusion.cpp/models/t5xxl_fp16.safetensors --steps 4 -p "a lovely cat" --cfg-scale 1.0 --sampling-method euler -v

and hit r+enter.

Green-Sky commented 2 months ago

You might also try pulling a more up-to-date ggml. There are always advancements and fixes happening upstream.

recallmenot commented 2 months ago

Sadly debuginfod did not retrieve the symbols for libze_intel_gpu.so.1:

Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x00007fffba8b7240 in ?? () from /usr/lib/libze_intel_gpu.so.1
@(gdb) bt
#0  0x00007fffba8b7240 in ?? () from /usr/lib/libze_intel_gpu.so.1
#1  0x00007fffba8b7f46 in ?? () from /usr/lib/libze_intel_gpu.so.1
#2  0x00007fffba86c3a2 in ?? () from /usr/lib/libze_intel_gpu.so.1
#3  0x00007fffba508af1 in ?? () from /usr/lib/libze_intel_gpu.so.1
#4  0x00007fffba50af69 in ?? () from /usr/lib/libze_intel_gpu.so.1
#5  0x00007fffba4e8962 in ?? () from /usr/lib/libze_intel_gpu.so.1
#6  0x00007ffff75930b6 in urQueueFinish () from /opt/intel/oneapi/compiler/2024.1/lib/libpi_level_zero.so
#7  0x00007ffff759c74b in piQueueFinish () from /opt/intel/oneapi/compiler/2024.1/lib/libpi_level_zero.so
#8  0x00007fffe47185b7 in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)23, _pi_queue*>(_pi_queue*) const ()
   from /opt/intel/oneapi/compiler/2024.1/lib/libsycl.so.7
#9  0x00007fffe48d043f in sycl::_V1::detail::queue_impl::wait(sycl::_V1::detail::code_location const&) () from /opt/intel/oneapi/compiler/2024.1/lib/libsycl.so.7
#10 0x000000000062747a in ggml_backend_sycl_synchronize(ggml_backend*) ()
#11 0x00000000005ab047 in ggml_backend_graph_compute ()
#12 0x00000000004b5295 in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) ()
#13 0x00000000004f4cf5 in FluxModel::compute(int, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, int, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, float, ggml_tensor**, ggml_context*) ()
#14 0x000000000051a171 in StableDiffusionGGML::sample(ggml_context*, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, ggml_tensor*, float, float, float, float, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition)::{lambda(ggml_tensor*, float, int)#1}::operator()(ggml_tensor*, float, int) const ()
#15 0x000000000049d657 in StableDiffusionGGML::sample(ggml_context*, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, ggml_tensor*, float, float, float, float, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition) ()
#16 0x000000000047c9f9 in generate_image(sd_ctx_t*, ggml_context*, ggml_tensor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, float, float, int, int, sample_method_t, std::vector<float, std::allocator<float> > const&, long, int, sd_image_t const*, float, float, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#17 0x000000000047e89f in txt2img ()
#18 0x00000000004212f6 in main ()

stable-diffusion.cpp was cloned with --recursive, head of ggml was detached at 21d3a30 (11 days) despite fresh git pull on stable-diffusion.cpp, after

cd ggml
git pull origin master

it was fast-forwarded to the current fbac47b

This resulted in build errors due to disabled GGML_MAX_N_THREADS of ggml c584042 in include/ggml.h. Moving the #define GGML_MAX_N_THREADS 512 outside the #ifndef GGML_MAX_NAME let me build again.

now gdb returns

[INFO ] stable-diffusion.cpp:235  - Version: Flux Schnell 
[INFO ] stable-diffusion.cpp:266  - Weight type:                 tq1_0
[INFO ] stable-diffusion.cpp:267  - Conditioner weight type:     tq1_0
[INFO ] stable-diffusion.cpp:268  - Diffusion model weight type: tq1_0
[INFO ] stable-diffusion.cpp:269  - VAE weight type:             tq1_0
[DEBUG] stable-diffusion.cpp:271  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:310  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:313  - CLIP: Using CPU backend
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  -  trigger word img already in vocab
sd: /home/recallmenot/stable-diffusion.cpp/ggml/src/ggml.c:3375: size_t ggml_row_size(enum ggml_type, int64_t): Assertion `ne % ggml_blck_size(type) == 0' failed.

Thread 1 "sd" received signal SIGABRT, Aborted.
Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44                                               
44        return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
@(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007ffff7cf6463 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
#2  0x00007ffff7c9d120 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7c844c3 in __GI_abort () at abort.c:79
#4  0x00007ffff7c843df in __assert_fail_base (fmt=0x7ffff7e14c20 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x6cca97 "ne % ggml_blck_size(type) == 0", file=file@entry=0x6cc702 "/home/a/stable-diffusion.cpp/ggml/src/ggml.c", 
    line=line@entry=3375, function=function@entry=0x6ccab6 "size_t ggml_row_size(enum ggml_type, int64_t)") at assert.c:94
#5  0x00007ffff7c95177 in __assert_fail (assertion=0x6cca97 "ne % ggml_blck_size(type) == 0", file=0x6cc702 "/home/a/stable-diffusion.cpp/ggml/src/ggml.c", line=3375, 
    function=0x6ccab6 "size_t ggml_row_size(enum ggml_type, int64_t)") at assert.c:103
#6  0x000000000054fddd in ggml_new_tensor_impl ()
#7  0x00000000005502f0 in ggml_new_tensor_2d ()
#8  0x00000000004dbe04 in Embedding::init_params(ggml_context*, ggml_type) ()
#9  0x00000000004a375f in GGMLBlock::init(ggml_context*, ggml_type) ()
#10 0x00000000004a375f in GGMLBlock::init(ggml_context*, ggml_type) ()
#11 0x00000000004a375f in GGMLBlock::init(ggml_context*, ggml_type) ()
#12 0x00000000004a375f in GGMLBlock::init(ggml_context*, ggml_type) ()
#13 0x00000000004a375f in GGMLBlock::init(ggml_context*, ggml_type) ()
#14 0x00000000004f2e79 in FluxCLIPEmbedder::FluxCLIPEmbedder(ggml_backend*, ggml_type, int) ()
#15 0x000000000049836c in StableDiffusionGGML::load_from_file(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool, ggml_type, schedule_t, bool, bool, bool) ()
#16 0x000000000047b212 in new_sd_ctx ()
#17 0x0000000000420fe8 in main ()

So it crashes way earlier.

Green-Sky commented 2 months ago
[INFO ] stable-diffusion.cpp:266  - Weight type:                 tq1_0
[INFO ] stable-diffusion.cpp:267  - Conditioner weight type:     tq1_0
[INFO ] stable-diffusion.cpp:268  - Diffusion model weight type: tq1_0
[INFO ] stable-diffusion.cpp:269  - VAE weight type:             tq1_0

I guess sd.cpp's type list is out of date, bc that is very wrong. Would also explain the typesize tensor mismatch assert.

recallmenot commented 2 months ago

it seemed fine before updating ggml:

[INFO ] stable-diffusion.cpp:235  - Version: Flux Schnell 
[INFO ] stable-diffusion.cpp:266  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:267  - Conditioner weight type:     f16
[INFO ] stable-diffusion.cpp:268  - Diffusion model weight type: q8_0
[INFO ] stable-diffusion.cpp:269  - VAE weight type:             f32
[DEBUG] stable-diffusion.cpp:271  - ggml tensor size = 400 bytes
Green-Sky commented 2 months ago

Yea, updating ggml might be more work than I thought.

Instead, can you try applying some patches manually, and see what happens? Like this one: https://github.com/ggerganov/llama.cpp/pull/9346

Green-Sky commented 2 months ago

It would also help if you could build in Debug mode, so we have line numbers :)

recallmenot commented 2 months ago

so after

cd ggml
git checkout 21d3a30

I applied llama.cpp 2a358fb.

and rebuilt with debugging.

Option: 
    n_threads:         8
    mode:              txt2img
    model_path:        
    wtype:             unspecified
    clip_l_path:       stable-diffusion.cpp/models/clip_l.safetensors
    t5xxl_path:        stable-diffusion.cpp/models/t5xxl_fp16.safetensors
    diffusion_model_path:   stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf
    vae_path:          stable-diffusion.cpp/models/ae.safetensors
    taesd_path:        
    esrgan_path:       
    controlnet_path:   
    embeddings_path:   
    stacked_id_embeddings_path:   
    input_id_images_path:   
    style ratio:       20.00
    normalize input image :  false
    output_path:       output.png
    init_img:          
    control_image:     
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            a lovely cat
    negative_prompt:   
    min_cfg:           1.00
    cfg_scale:         1.00
    guidance:          3.50
    clip_skip:         -1
    width:             512
    height:            512
    sample_method:     euler
    schedule:          default
    sample_steps:      4
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info: 
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:175  - Using SYCL backend
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
Downloading separate debug info for /usr/lib/libze_loader.so.1
Downloading separate debug info for /usr/lib/libfmt.so.10                                                                                                             
Downloading separate debug info for /opt/intel/oneapi/compiler/2024.1/lib/libonnxruntime.1.12.22.721.so                                                               
Downloading separate debug info for /usr/lib/intel-opencl/libigdrcl.so                                                                                                
Downloading separate debug info for /usr/lib/libigdgmm.so.12                                                                                                          
[New Thread 0x7fffcda006c0 (LWP 207031)]                                                                                                                              
Downloading separate debug info for /usr/lib/libigdfcl.so.1
Downloading separate debug info for /usr/lib/libopencl-clang.so.14                                                                                                    
Downloading separate debug info for /usr/lib/libigc.so.1                                                                                                              
Downloading separate debug info for /usr/lib/libnvidia-opencl.so.1                                                                                                    
Downloading separate debug info for /usr/lib/libze_intel_gpu.so.1                                                                                                     
Downloading separate debug info for /usr/lib/libze_tracing_layer.so.1                                                                                                 
[New Thread 0x7fffcd0006c0 (LWP 207033)]                                                                                                                              
[New Thread 0x7fffba2006c0 (LWP 207034)]
[Thread 0x7fffba2006c0 (LWP 207034) exited]
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                 Intel UHD Graphics 770|    1.3|     32|     512|   32| 78526M|            1.3.30049|
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
[INFO ] stable-diffusion.cpp:202  - loading clip_l from 'stable-diffusion.cpp/models/clip_l.safetensors'
[INFO ] model.cpp:793  - load stable-diffusion.cpp/models/clip_l.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from 'stable-diffusion.cpp/models/clip_l.safetensors'
[INFO ] stable-diffusion.cpp:209  - loading t5xxl from 'stable-diffusion.cpp/models/t5xxl_fp16.safetensors'
[INFO ] model.cpp:793  - load stable-diffusion.cpp/models/t5xxl_fp16.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from 'stable-diffusion.cpp/models/t5xxl_fp16.safetensors'
[INFO ] stable-diffusion.cpp:216  - loading diffusion model from 'stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf'
[INFO ] model.cpp:790  - load stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf using gguf format
[DEBUG] model.cpp:807  - init from 'stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf'
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_calloc!
[INFO ] stable-diffusion.cpp:223  - loading vae from 'stable-diffusion.cpp/models/ae.safetensors'
[INFO ] model.cpp:793  - load stable-diffusion.cpp/models/ae.safetensors using safetensors format
[DEBUG] model.cpp:861  - init from 'stable-diffusion.cpp/models/ae.safetensors'
[INFO ] stable-diffusion.cpp:235  - Version: Flux Schnell 
[INFO ] stable-diffusion.cpp:266  - Weight type:                 f16
[INFO ] stable-diffusion.cpp:267  - Conditioner weight type:     f16
[INFO ] stable-diffusion.cpp:268  - Diffusion model weight type: q8_0
[INFO ] stable-diffusion.cpp:269  - VAE weight type:             f32
[DEBUG] stable-diffusion.cpp:271  - ggml tensor size = 400 bytes
[INFO ] stable-diffusion.cpp:310  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:313  - CLIP: Using CPU backend
[DEBUG] clip.hpp:171  - vocab size: 49408
[DEBUG] clip.hpp:182  -  trigger word img already in vocab
[DEBUG] ggml_extend.hpp:1050 - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:1050 - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:1050 - flux params backend buffer size =  12057.71 MB(VRAM) (776 tensors)
[DEBUG] ggml_extend.hpp:1050 - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
[DEBUG] stable-diffusion.cpp:398  - loading weights
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/clip_l.safetensors
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/t5xxl_fp16.safetensors
[INFO ] model.cpp:1685 - unknown tensor 'text_encoders.t5xxl.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/flux1-schnell-q8_0.gguf
[DEBUG] model.cpp:1530 - loading tensors from stable-diffusion.cpp/models/ae.safetensors
[INFO ] stable-diffusion.cpp:497  - total params memory size = 21471.11MB (VRAM 12152.28MB, RAM 9318.83MB): clip 9318.83MB(RAM), unet 12057.71MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM)
[INFO ] stable-diffusion.cpp:501  - loading model from '' completed, taking 14.95s
[INFO ] stable-diffusion.cpp:518  - running in Flux FLOW mode
[DEBUG] stable-diffusion.cpp:572  - finished loaded file
[DEBUG] stable-diffusion.cpp:1378 - txt2img 512x512
[DEBUG] stable-diffusion.cpp:1127 - prompt after extract and remove lora: "a lovely cat"
[INFO ] stable-diffusion.cpp:655  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1132 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:1036 - parse 'a lovely cat' to [['a lovely cat', 1], ]
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] t5.hpp:397  - token length: 256
ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 68.25 MiB
[DEBUG] ggml_extend.hpp:1001 - t5 compute buffer size: 68.25 MB(RAM)
[New Thread 0x7fffb9800740 (LWP 207066)]
[New Thread 0x7fffb8e007c0 (LWP 207067)]
[New Thread 0x7fffb1200840 (LWP 207068)]
[New Thread 0x7ffd610008c0 (LWP 207069)]
[New Thread 0x7ffd5be00940 (LWP 207070)]
[New Thread 0x7ffd5b4009c0 (LWP 207071)]
[New Thread 0x7ffd5aa00a40 (LWP 207072)]
[DEBUG] conditioner.hpp:1155 - computing condition graph completed, taking 9135 ms
[INFO ] stable-diffusion.cpp:1256 - get_learned_condition completed, taking 9138 ms
[INFO ] stable-diffusion.cpp:1279 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1283 - generating image: 1/1 - seed 42
ggml_gallocr_reserve_n: reallocating SYCL0 buffer from size 0.00 MiB to 398.50 MiB
[DEBUG] ggml_extend.hpp:1001 - flux compute buffer size: 398.50 MB(VRAM)
AssertHandler::printMessage

Thread 1 "sd" received signal SIGSEGV, Segmentation fault.
0x00007fffba8b7240 in ?? () from /usr/lib/libze_intel_gpu.so.1
@(gdb) bt
#0  0x00007fffba8b7240 in ?? () from /usr/lib/libze_intel_gpu.so.1
#1  0x00007fffba8b7f46 in ?? () from /usr/lib/libze_intel_gpu.so.1
#2  0x00007fffba86c3a2 in ?? () from /usr/lib/libze_intel_gpu.so.1
#3  0x00007fffba508af1 in ?? () from /usr/lib/libze_intel_gpu.so.1
#4  0x00007fffba50af69 in ?? () from /usr/lib/libze_intel_gpu.so.1
#5  0x00007fffba4e8962 in ?? () from /usr/lib/libze_intel_gpu.so.1
#6  0x00007ffff75930b6 in urQueueFinish () from /opt/intel/oneapi/compiler/2024.1/lib/libpi_level_zero.so
#7  0x00007ffff759c74b in piQueueFinish () from /opt/intel/oneapi/compiler/2024.1/lib/libpi_level_zero.so
#8  0x00007fffe47185b7 in _pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)23, _pi_queue*>(_pi_queue*) const ()
   from /opt/intel/oneapi/compiler/2024.1/lib/libsycl.so.7
#9  0x00007fffe48d043f in sycl::_V1::detail::queue_impl::wait(sycl::_V1::detail::code_location const&) () from /opt/intel/oneapi/compiler/2024.1/lib/libsycl.so.7
#10 0x000000000062751a in ggml_backend_sycl_synchronize(ggml_backend*) ()
#11 0x00000000005ab047 in ggml_backend_graph_compute ()
#12 0x00000000004b5295 in GGMLRunner::compute(std::function<ggml_cgraph* ()>, int, bool, ggml_tensor**, ggml_context*) ()
#13 0x00000000004f4cf5 in FluxModel::compute(int, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, int, std::vector<ggml_tensor*, std::allocator<ggml_tensor*> >, float, ggml_tensor**, ggml_context*) ()
#14 0x000000000051a171 in StableDiffusionGGML::sample(ggml_context*, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, ggml_tensor*, float, float, float, float, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition)::{lambda(ggml_tensor*, float, int)#1}::operator()(ggml_tensor*, float, int) const ()
#15 0x000000000049d657 in StableDiffusionGGML::sample(ggml_context*, ggml_tensor*, ggml_tensor*, SDCondition, SDCondition, ggml_tensor*, float, float, float, float, sample_method_t, std::vector<float, std::allocator<float> > const&, int, SDCondition) ()
#16 0x000000000047c9f9 in generate_image(sd_ctx_t*, ggml_context*, ggml_tensor*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, float, float, int, int, sample_method_t, std::vector<float, std::allocator<float> > const&, long, int, sd_image_t const*, float, float, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) ()
#17 0x000000000047e89f in txt2img ()
#18 0x00000000004212f6 in main ()

And the same as initial behavior.