Opening model slows down inference

shreyvish5678 commented 4 months ago

I'm running an 8 bit quantized version of SDXL turbo on a 3060 Laptop GPU, and the txt2img part itself takes around 2.5s, opening the model takes ~25s. I want to be able to generate multiple images with a same prompt so I did the following:

import os
from tqdm import tqdm
for i in tqdm(range(16)):
    os.system(f"./bin/sd -m ../models/sd_xl_turbo_1.0.q8_0.gguf --vae ../models/sdxl_vae.safetensors -s -1 -p 'a cute cat' --cfg-scale 1.0 --steps 4 -o pics/output_{i}.png")

I noticed in the logs, that for every iteration, the model was being re-opened. Is there a way to already load in the model once and generate multiple images sequentially? Logs:

0%|          | 0/16 [00:00<?, ?it/s]ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
[INFO ] stable-diffusion.cpp:169  - loading model from '../models/sd_xl_turbo_1.0.q8_0.gguf'
[INFO ] model.cpp:732  - load ../models/sd_xl_turbo_1.0.q8_0.gguf using gguf format
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_malloc!
[INFO ] stable-diffusion.cpp:180  - loading vae from '../models/sdxl_vae.safetensors'
[INFO ] model.cpp:735  - load ../models/sdxl_vae.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:192  - Stable Diffusion XL 
[INFO ] stable-diffusion.cpp:198  - Stable Diffusion weight type: q8_0
[INFO ] stable-diffusion.cpp:404  - total params memory size = 3855.36MB (VRAM 3855.36MB, RAM 0.00MB): clip 835.53MB(VRAM), unet 2925.36MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:423  - loading model from '../models/sd_xl_turbo_1.0.q8_0.gguf' completed, taking 28.44s
[INFO ] stable-diffusion.cpp:440  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:556  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1585 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1698 - get_learned_condition completed, taking 368 ms
[INFO ] stable-diffusion.cpp:1716 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1720 - generating image: 1/1 - seed 1534224841
  |==================================================| 4/4 - 3.30it/s
[INFO ] stable-diffusion.cpp:1763 - sampling completed, taking 1.22s
[INFO ] stable-diffusion.cpp:1771 - generating 1 latent images completed, taking 1.26s
[INFO ] stable-diffusion.cpp:1774 - decoding 1 latents
  6%|▋         | 1/16 [00:33<08:16, 33.11s/it]
[INFO ] stable-diffusion.cpp:1784 - latent 1 decoded, taking 0.85s
[INFO ] stable-diffusion.cpp:1788 - decode_first_stage completed, taking 0.85s
[INFO ] stable-diffusion.cpp:1872 - txt2img completed in 2.48s
save result image to 'pics/output_0.png'
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6, VMM: yes
[INFO ] stable-diffusion.cpp:169  - loading model from '../models/sd_xl_turbo_1.0.q8_0.gguf'
[INFO ] model.cpp:732  - load ../models/sd_xl_turbo_1.0.q8_0.gguf using gguf format
WARNING: Behavior may be unexpected when allocating 0 bytes for ggml_malloc!
[INFO ] stable-diffusion.cpp:180  - loading vae from '../models/sdxl_vae.safetensors'
[INFO ] model.cpp:735  - load ../models/sdxl_vae.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:192  - Stable Diffusion XL 
[INFO ] stable-diffusion.cpp:198  - Stable Diffusion weight type: q8_0
[INFO ] stable-diffusion.cpp:404  - total params memory size = 3855.36MB (VRAM 3855.36MB, RAM 0.00MB): clip 835.53MB(VRAM), unet 2925.36MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:423  - loading model from '../models/sd_xl_turbo_1.0.q8_0.gguf' completed, taking 28.01s

grauho commented 4 months ago

I think the "--batch-count " switch is what you're looking for.

Amin456789 commented 4 months ago

use koboldcpp or one of the gui that is here. they keep it in the ram so it wont need to reload it everytime

leejet / stable-diffusion.cpp

Opening model slows down inference #273