Closed Windsander closed 5 months ago
The description sounds like this should be split into 2 separate PRs. (nice work on improving the ctx lifecycle)
The description sounds like this should be split into 2 separate PRs. (nice work on improving the ctx lifecycle)
yep~try to make a nice cheese~:p I've tring to divide configurations. emmm, maybe helpful(wish :p)~
currently I'm working on fix VID of this project, hopefully could commit in time, haha~
Thank you for your contribution. However, I prefer to decouple the generate parameter to sd_ctx_t, rather than putting the generate parameter inside sd_ctx_t.
I think it's better to keep the api in its original form, and some of the logic shared by txt2img/img2img can be detached as separate functions and called in txt2img/img2img.
seems txt2img/img2img can share almost same params, but much different in svd-xt/svd visual-model.
I've been fixed svd-xt vae and moving on base model.
ahh..last update really nice, I v got some clue :p
I think it's better to keep the api in its original form, and some of the logic shared by txt2img/img2img can be detached as separate functions and called in txt2img/img2img.
Strictly looking at the API-side, I absolutely agree, that the api should be kept in a way, that the parameters do not need a separate call. (I also think it's not be obvious as a user why these should be persistant between generations)
What I really like though is grouping parameters the way it's done with sd_video_t
. That's something I'd also like to see for control-net and photomaker parameters.
It would be clear without having to know the feature in detail which parameter belongs to which feature, and there would be a verbose way to disable these by passing NULL
(currently I have to guess what i need to set to disable them)
I think it's better to keep the api in its original form, and some of the logic shared by txt2img/img2img can be detached as separate functions and called in txt2img/img2img.
Strictly looking at the API-side, I absolutely agree, that the api should be kept in a way, that the parameters do not need a separate call. (I also think it's not be obvious as a user why these should be persistant between generations)
What I really like though is grouping parameters the way it's done with
sd_video_t
. That's something I'd also like to see for control-net and photomaker parameters. It would be clear without having to know the feature in detail which parameter belongs to which feature, and there would be a verbose way to disable these by passingNULL
(currently I have to guess what i need to set to disable them)
that's right~
I'm working on sinking the config file to sd_cxt_t and assigning the fine-tuned config to specific feature groups.
In recent practices, I found that it's better to fix the SVD functionality first, before we unify the config decoupling and the key point that everyone mentioned.
last update I just push, have done a partly work on load & parsing model tensors.
In the latest commit, I included some experimental integration in this part, which might look a bit messy. fix: make SVD-XT available part-1
I'll clean this up after the SVD-XT part-2 fix is committed
Now is done with asked features XD~
keep txt2img/img2img/img2vid API divided, by extracting working flow based on SD stage.
and different part of prompts engineering work with config(if u setting).
Some test will still needs to check. So the WIP currently remain.
resblock about using self defined vae has been fixed
currently testing img2vid result output decoder (vae) stage svd-xt last processing sections
seems all fixed. but consider to our structure is based on 1 frame/batch, which I've been quite certain the time_embeddings & input latent is associated with related noise/clip_embeds are been controlled by each loop scheduler in mid-logistic layer.
makes its a reasonable way to just make video generating relay on the same init-laden & noise(just been used in init, when time flow, the true noise tensor we used in this project is based on scheduling tensor plus init-noise, based on paper)
so, the img2vid entry logic should be like below:
and all video_frames in unet should be removed, cause there only 1 frame * 1 batch makes N count to 1.
otherwise, we need to move upper-logic batch looper into UNet part, that seems unacceptable.
so, I used the first solution. XD
currently 1 frames test has passed.
and I'm moving to 14 frames test, which may needs many hours to generate.
but review can be processing. ;P @leejet @Green-Sky @DarthAffe @dmikey
guys~ help me having some checks plz~ - 0v0
ALL Test Passed!!! haha~
now, u can try~ suggest using 60 steps + 6/8 fps with 14/16 frames.
[WIP] now wil remove! xD
SVD-XT abilities has:
@Windsander what models are you testing this with?
I am trying to run this with svd_xt-fp16.safetensors
, but it fails.
Can you also share a full command line, so I know what needs to be set?
@Windsander what models are you testing this with? yep
I'm using the sd official :
https://huggingface.co/stabilityai/stable-video-diffusion-img2vid
with prompt:
-p " best quality, extremely detailed, keep main character, in outer space " --model ../../../sd-video/svd_xt.safetensors --vae ../../../sd-video/vae/diffusion_pytorch_model.fp16.safetensors --sampling-method lcm --strength 0.65 --style-ratio 10 --steps 3 --cfg-scale 1.0 --seed 15.0 -M img2vid -H 1024 -W 1024 --video-total-frames 14 --video-fps 8 --motion-bucket-id 42 -i ../../../io-test/input.png -o ../../../io-test/output.png -v
I am trying to run this with
svd_xt-fp16.safetensors
, but it fails.Can you also share a full command line, so I know what needs to
god... wrong click...
how could I reopen this quest. my, aww,
seems all fixed. but consider to our structure is based on 1 frame/batch, which I've been quite certain the time_embeddings & input latent is associated with related noise/clip_embeds are been controlled by each loop scheduler in mid-logistic layer.
makes its a reasonable way to just make video generating relay on the same init-laden & noise(just been used in init, when time flow, the true noise tensor we used in this project is based on scheduling tensor plus init-noise, based on paper)
so, the img2vid entry logic should be like below:
and all video_frames in unet should be removed, cause there only 1 frame * 1 batch makes N count to 1.
otherwise, we need to move upper-logic batch looper into UNet part, that seems unacceptable.
so, I used the first solution. XD
Regardless of the value of num_frames, the original SpatialVideoTransformer should be able to handle it correctly without additional loops because the forward operation supports batch inference. Unless there are issues with batch inference implementation in some operators within GGML. However, once you input each frame separately into UNet and then concatenate the results, the outcome will be different from when all frames are simultaneously input into UNet. For example, the input to time_stack is [B h w, T, inner_dim], where T is equal to num_frames. time_stack performs self-attention. Obviously, inputting each frames separately will yield different results from inputting all frames together. The time_stack operation utilizes all frames to perform attention, specifically focusing on the temporal dimension, hence its name "time_stack."
If you input each frame individually into UNet based on num_frames, you're essentially generating a video consisting of num_frames single-frame videos stitched together, rather than producing a single video with num_frames frames.
What this commit done
-- New features:
-- Core abilities been added to sd.cpp img2img mode:
-- Refact sd_context lifecycle with ggml_context maintain:
What SVD-XT(img2vid) abilities has
Which been tested by 7 cases below (only params):
img2img
txt2img
img2vid
Test model version: