Refact process flow + Fix img2vid mode + Provide abilities to img2img mode

Windsander commented 6 months ago

What this commit done

-- New features:

Now img2vid is available on sd.cpp
RealESRGAN_x4plus_anime_6B for each mode output, is supported
Image + Text Prompts(positive/negative) now is supported for each mode

-- Core abilities been added to sd.cpp img2img mode:

Provide LoRA support
Provide ControlNet support
Provide PhotoMaker embeddings available

-- Refact sd_context lifecycle with ggml_context maintain:

Make ggml_context retainable for continuous using
Reconfig ggml_context only on necessary, like prompt recreate etc. 3.Devide SD_core models workflow from prompt preprocess & embedding flow

What SVD-XT(img2vid) abilities has

support <prompt + negative-prompt>
support embedding with init-image
support as start spatial-info, which we familiar as Image-prompt
support with LoRAs help.
support with PhotoMaker.

Which been tested by 7 cases below (only params):

img2img

 #test case: img2img<sdxl,sdxl-vae>
 -p " best quality, extremely detailed, only main character, keep background scene, portrait of girl " -n " every characters, change background scene, worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, ugly, blurry, Tan skin, dark skin, black skin, skin spots, skin blemishes, age spot, glans, disabled, bad anatomy, amputation, bad proportions, twins, missing body, fused body, extra head, poorly drawn face, bad eyes, deformed eye, unclear eyes, cross-eyed, long neck, malformed limbs, extra limbs, extra arms, missing arms, bad tongue, strange fingers, mutated hands, missing hands, poorly drawn hands, extra hands, fused hands, connected hand, bad hands, missing fingers, extra fingers, 4 fingers, 3 fingers, deformed hands, extra legs, bad legs, many legs, more than two legs, bad feet, extra feet " --model ../../../sd-models/sd_xl_turbo_1.0_fp16.safetensors --vae ../../../sd-models/sdxl_vae.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 3 --cfg-scale 1.0 --seed 15.0 -M img2img -H 1024 -W 1024 -i ../../../io-test/input.png -o ../../../io-test/output.png -v

 #test case: img2img<sdxl,sdxl-vae,lcm-lora>
 -p " best quality, extremely detailed, keep main character<lora:lcm-lora-sdxl:1>, at night with stars " -n " worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, ugly, blurry, Tan skin, dark skin, black skin, skin spots, skin blemishes, age spot, glans, disabled, bad anatomy, amputation, bad proportions, twins, missing body, fused body, extra head, poorly drawn face, bad eyes, deformed eye, unclear eyes, cross-eyed, long neck, malformed limbs, extra limbs, extra arms, missing arms, bad tongue, strange fingers, mutated hands, missing hands, poorly drawn hands, extra hands, fused hands, connected hand, bad hands, missing fingers, extra fingers, 4 fingers, 3 fingers, deformed hands, extra legs, bad legs, many legs, more than two legs, bad feet, extra feet " --lora-model-dir ../../../sd-control-lora/control-LoRAs-lcm --model ../../../sd-models/sd_xl_turbo_1.0_fp16.safetensors --vae ../../../sd-models/sdxl_vae.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 3 --cfg-scale 1.0 --seed 15.0 -M img2img -H 1024 -W 1024 -i ../../../io-test/input.png -o ../../../io-test/output.png -v

 #test case: img2img<sdv1.5, sdv1.5-vae, lcm-lora, control_net>
 -p " best quality, extremely detailed, keep main character<lora:lcm-lora-sdv1-5:1>, at night with stars " -n " worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, ugly, blurry, Tan skin, dark skin, black skin, skin spots, skin blemishes, age spot, glans, disabled, bad anatomy, amputation, bad proportions, twins, missing body, fused body, extra head, poorly drawn face, bad eyes, deformed eye, unclear eyes, cross-eyed, long neck, malformed limbs, extra limbs, extra arms, missing arms, bad tongue, strange fingers, mutated hands, missing hands, poorly drawn hands, extra hands, fused hands, connected hand, bad hands, missing fingers, extra fingers, 4 fingers, 3 fingers, deformed hands, extra legs, bad legs, many legs, more than two legs, bad feet, extra feet " --control-net ../../../sd-control-net/sdv1_5/control_v11p_sd15_canny_2.safetensors --lora-model-dir ../../../sd-control-lora/control-LoRAs-lcm --model ../../../sd-models/v1-5-pruned-emaonly.safetensors --vae ../../../sd-models/diffusion_pytorch_model.fp16.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 3 --cfg-scale 1.0 --seed 15.0 -M img2img -H 1024 -W 1024 --control-image ../../../io-test/input_canny_mask.png -i ../../../io-test/input.png -o ../../../io-test/output.png -v

txt2img

 #test case: txt2img<sdxl,sdxl-vae>
 -p " best quality, extremely detailed, only main character, keep background scene, portrait of girl " -n " every characters, change background scene, worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, ugly, blurry, Tan skin, dark skin, black skin, skin spots, skin blemishes, age spot, glans, disabled, bad anatomy, amputation, bad proportions, twins, missing body, fused body, extra head, poorly drawn face, bad eyes, deformed eye, unclear eyes, cross-eyed, long neck, malformed limbs, extra limbs, extra arms, missing arms, bad tongue, strange fingers, mutated hands, missing hands, poorly drawn hands, extra hands, fused hands, connected hand, bad hands, missing fingers, extra fingers, 4 fingers, 3 fingers, deformed hands, extra legs, bad legs, many legs, more than two legs, bad feet, extra feet " --model ../../../sd-models/sd_xl_turbo_1.0_fp16.safetensors --vae ../../../sd-models/sdxl_vae.safetensors --sampling-method lcm --style-ratio 10 --steps 20 --cfg-scale 1.0 --seed 15.0 -M txt2img -H 1024 -W 1024 -o ../../../io-test/output.png -v

 #test case: txt2img<sdxl,sdxl-vae,lcm-lora>
 -p " best quality, extremely detailed, keep main character<lora:lcm-lora-sdxl:1>, at night with stars " -n " worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, ugly, blurry, Tan skin, dark skin, black skin, skin spots, skin blemishes, age spot, glans, disabled, bad anatomy, amputation, bad proportions, twins, missing body, fused body, extra head, poorly drawn face, bad eyes, deformed eye, unclear eyes, cross-eyed, long neck, malformed limbs, extra limbs, extra arms, missing arms, bad tongue, strange fingers, mutated hands, missing hands, poorly drawn hands, extra hands, fused hands, connected hand, bad hands, missing fingers, extra fingers, 4 fingers, 3 fingers, deformed hands, extra legs, bad legs, many legs, more than two legs, bad feet, extra feet " --lora-model-dir ../../../sd-control-lora/control-LoRAs-lcm --model ../../../sd-models/sd_xl_turbo_1.0_fp16.safetensors --vae ../../../sd-models/sdxl_vae.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 40 --cfg-scale 1.0 --seed 15.0 -M txt2img -H 1024 -W 1024 -o ../../../io-test/output.png -v

 #test case: txt2img<sdv1.5, sdv1.5-vae, lcm-lora, control_net(canny)>
-p " best quality, extremely detailed, keep main character<lora:lcm-lora-sdv1-5:1>, at night with stars " -n " worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, ugly, blurry, Tan skin, dark skin, black skin, skin spots, skin blemishes, age spot, glans, disabled, bad anatomy, amputation, bad proportions, twins, missing body, fused body, extra head, poorly drawn face, bad eyes, deformed eye, unclear eyes, cross-eyed, long neck, malformed limbs, extra limbs, extra arms, missing arms, bad tongue, strange fingers, mutated hands, missing hands, poorly drawn hands, extra hands, fused hands, connected hand, bad hands, missing fingers, extra fingers, 4 fingers, 3 fingers, deformed hands, extra legs, bad legs, many legs, more than two legs, bad feet, extra feet " --type f16 --control-net ../../../sd-control-net/sdv1_5/control_v11p_sd15_canny_2.safetensors --lora-model-dir ../../../sd-control-lora/control-LoRAs-lcm --model ../../../sd-models/v1-5-pruned-emaonly.safetensors --vae ../../../sd-models/diffusion_pytorch_model.fp16.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 60 --cfg-scale 1.0 --seed 15.0 -M txt2img -H 1024 -W 1024 --control-image ../../../io-test/input_canny_mask.png -o ../../../io-test/output.png -v

img2vid

  #test case: img2vid<svd-xt, svd-xt-vae>
 -p " best quality, extremely detailed, keep main character, in outer space " -n " worst quality, low quality, normal quality, lowres, watermark, monochrome, grayscale, ugly, blurry, Tan skin, dark skin, black skin, skin spots, skin blemishes, age spot, glans, disabled, bad anatomy, amputation, bad proportions, twins, missing body, fused body, extra head, poorly drawn face, bad eyes, deformed eye, unclear eyes, cross-eyed, long neck, malformed limbs, extra limbs, extra arms, missing arms, bad tongue, strange fingers, mutated hands, missing hands, poorly drawn hands, extra hands, fused hands, connected hand, bad hands, missing fingers, extra fingers, 4 fingers, 3 fingers, deformed hands, extra legs, bad legs, many legs, more than two legs, bad feet, extra feet " --model ../../../sd-video/svd_xt.safetensors --vae ../../../sd-video/vae/diffusion_pytorch_model.fp16.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 3 --cfg-scale 1.0 --seed 15.0 -M img2vid -H 1024 -W 1024 --video-total-frames 14 --video-fps 14 --motion-bucket-id 42 -i ../../../io-test/input.png -o ../../../io-test/output.png -v

  #test case: img2vid<svd-xt, svd-xt-vae>(no negative-prompt)
 -p " best quality, extremely detailed, keep main character, in outer space " --model ../../../sd-video/svd_xt.safetensors --vae ../../../sd-video/vae/diffusion_pytorch_model.fp16.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 3 --cfg-scale 1.0 --seed 15.0 -M img2vid -H 1024 -W 1024 --video-total-frames 14 --video-fps 8 --motion-bucket-id 42 -i ../../../io-test/input.png -o ../../../io-test/output.png -v

Test model version: 8201711714129_ pic

Green-Sky commented 6 months ago

The description sounds like this should be split into 2 separate PRs. (nice work on improving the ctx lifecycle)

Windsander commented 6 months ago

The description sounds like this should be split into 2 separate PRs. (nice work on improving the ctx lifecycle)

yep~try to make a nice cheese~:p I've tring to divide configurations. emmm, maybe helpful(wish :p)~

currently I'm working on fix VID of this project, hopefully could commit in time, haha~

leejet commented 6 months ago

Thank you for your contribution. However, I prefer to decouple the generate parameter to sd_ctx_t, rather than putting the generate parameter inside sd_ctx_t.

leejet commented 6 months ago

I think it's better to keep the api in its original form, and some of the logic shared by txt2img/img2img can be detached as separate functions and called in txt2img/img2img.

Windsander commented 6 months ago

seems txt2img/img2img can share almost same params, but much different in svd-xt/svd visual-model.

I've been fixed svd-xt vae and moving on base model.

Windsander commented 6 months ago

XD https://github.com/leejet/stable-diffusion.cpp/pull/157/files#diff-a8e875e8f7684a2f9e35212d88ee9574001952eb2c049385e73d5e24fd6e2a83

ahh..last update really nice, I v got some clue :p

DarthAffe commented 6 months ago

I think it's better to keep the api in its original form, and some of the logic shared by txt2img/img2img can be detached as separate functions and called in txt2img/img2img.

Strictly looking at the API-side, I absolutely agree, that the api should be kept in a way, that the parameters do not need a separate call. (I also think it's not be obvious as a user why these should be persistant between generations)

What I really like though is grouping parameters the way it's done with sd_video_t. That's something I'd also like to see for control-net and photomaker parameters. It would be clear without having to know the feature in detail which parameter belongs to which feature, and there would be a verbose way to disable these by passing NULL (currently I have to guess what i need to set to disable them)

Windsander commented 6 months ago

I think it's better to keep the api in its original form, and some of the logic shared by txt2img/img2img can be detached as separate functions and called in txt2img/img2img.

Strictly looking at the API-side, I absolutely agree, that the api should be kept in a way, that the parameters do not need a separate call. (I also think it's not be obvious as a user why these should be persistant between generations)

What I really like though is grouping parameters the way it's done with sd_video_t. That's something I'd also like to see for control-net and photomaker parameters. It would be clear without having to know the feature in detail which parameter belongs to which feature, and there would be a verbose way to disable these by passing NULL (currently I have to guess what i need to set to disable them)

that's right~

I'm working on sinking the config file to sd_cxt_t and assigning the fine-tuned config to specific feature groups.

In recent practices, I found that it's better to fix the SVD functionality first, before we unify the config decoupling and the key point that everyone mentioned.

Windsander commented 6 months ago

last update I just push, have done a partly work on load & parsing model tensors.

In the latest commit, I included some experimental integration in this part, which might look a bit messy. fix: make SVD-XT available part-1

I'll clean this up after the SVD-XT part-2 fix is committed

Windsander commented 6 months ago

Now is done with asked features XD~

keep txt2img/img2img/img2vid API divided, by extracting working flow based on SD stage.

and different part of prompts engineering work with config(if u setting).

Windsander commented 6 months ago

Some test will still needs to check. So the WIP currently remain.

Windsander commented 5 months ago

resblock about using self defined vae has been fixed

Windsander commented 5 months ago

currently testing img2vid result output decoder (vae) stage svd-xt last processing sections

Windsander commented 5 months ago

seems all fixed. but consider to our structure is based on 1 frame/batch, which I've been quite certain the time_embeddings & input latent is associated with related noise/clip_embeds are been controlled by each loop scheduler in mid-logistic layer.

makes its a reasonable way to just make video generating relay on the same init-laden & noise(just been used in init, when time flow, the true noise tensor we used in this project is based on scheduling tensor plus init-noise, based on paper)

so, the img2vid entry logic should be like below:

and all video_frames in unet should be removed, cause there only 1 frame * 1 batch makes N count to 1.

otherwise, we need to move upper-logic batch looper into UNet part, that seems unacceptable.

so, I used the first solution. XD

Windsander commented 5 months ago

currently 1 frames test has passed.

and I'm moving to 14 frames test, which may needs many hours to generate.

but review can be processing. ;P @leejet @Green-Sky @DarthAffe @dmikey

guys~ help me having some checks plz~ - 0v0

Windsander commented 5 months ago

ALL Test Passed!!! haha~

now, u can try~ suggest using 60 steps + 6/8 fps with 14/16 frames.

[WIP] now wil remove! xD

Windsander commented 5 months ago

SVD-XT abilities has:

support <prompt + negative-prompt>
support embedding with init-image
support as start spatial-info, which we familiar as Image-prompt
support with LoRAs help.
support with PhotoMaker.

Green-Sky commented 5 months ago

@Windsander what models are you testing this with?

Green-Sky commented 5 months ago

I am trying to run this with svd_xt-fp16.safetensors, but it fails.

Can you also share a full command line, so I know what needs to be set?

Windsander commented 5 months ago

@Windsander what models are you testing this with? yep

I'm using the sd official :

https://huggingface.co/stabilityai/stable-video-diffusion-img2vid

with prompt:

-p " best quality, extremely detailed, keep main character, in outer space " --model ../../../sd-video/svd_xt.safetensors --vae ../../../sd-video/vae/diffusion_pytorch_model.fp16.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 3 --cfg-scale 1.0 --seed 15.0 -M img2vid -H 1024 -W 1024 --video-total-frames 14 --video-fps 8 --motion-bucket-id 42 -i ../../../io-test/input.png -o ../../../io-test/output.png -v

Windsander commented 5 months ago

I am trying to run this with svd_xt-fp16.safetensors, but it fails.

Can you also share a full command line, so I know what needs to

Windsander commented 5 months ago

god... wrong click...

Windsander commented 5 months ago

how could I reopen this quest. my, aww,

Windsander commented 5 months ago

the newest in here:

https://github.com/leejet/stable-diffusion.cpp/pull/236

.my stupid hand.

leejet commented 5 months ago

seems all fixed. but consider to our structure is based on 1 frame/batch, which I've been quite certain the time_embeddings & input latent is associated with related noise/clip_embeds are been controlled by each loop scheduler in mid-logistic layer.

makes its a reasonable way to just make video generating relay on the same init-laden & noise(just been used in init, when time flow, the true noise tensor we used in this project is based on scheduling tensor plus init-noise, based on paper)

so, the img2vid entry logic should be like below:

and all video_frames in unet should be removed, cause there only 1 frame * 1 batch makes N count to 1.

otherwise, we need to move upper-logic batch looper into UNet part, that seems unacceptable.

so, I used the first solution. XD

Regardless of the value of num_frames, the original SpatialVideoTransformer should be able to handle it correctly without additional loops because the forward operation supports batch inference. Unless there are issues with batch inference implementation in some operators within GGML. However, once you input each frame separately into UNet and then concatenate the results, the outcome will be different from when all frames are simultaneously input into UNet. For example, the input to time_stack is [B h w, T, inner_dim], where T is equal to num_frames. time_stack performs self-attention. Obviously, inputting each frames separately will yield different results from inputting all frames together. The time_stack operation utilizes all frames to perform attention, specifically focusing on the temporal dimension, hence its name "time_stack."

If you input each frame individually into UNet based on num_frames, you're essentially generating a video consisting of num_frames single-frame videos stitched together, rather than producing a single video with num_frames frames.

leejet / stable-diffusion.cpp

Refact process flow + Fix img2vid mode + Provide abilities to img2img mode #210