Open kolabearafk opened 1 year ago
I can take a shot to see if this works with current available implementation floating around.
If we're just to training the CrossAttention layers (finetuning the Psuedo Conv3D layers are tricky) and limiting the size to 256x256, It may (this is a big if) be able to fit in 24GB of VRAM.
Also, I don't know if they used a DDPM scheduler for training or the Gaussian Diffusion scheduler for training as I don't know the correlating paper for this implementation. It seems to be a mix of video diffusion and Make-A-Video.
Either way, the process should be very simple if we reference the training methods we have floating around.
I'm also curious since the model already has a sufficient amount of data, you may be able to fine tune it in an unconditional way (no prompts, just video data).
I created a repository for Text2Video finetuning here using the recent Diffusers addition. Let me know how it goes if you give it a shot!
Incredible! @ExponentialML, I'll post it on Reddit if you don't mind?
Upd: posted here https://www.reddit.com/r/StableDiffusion/comments/11zhy1b/wake_up_samurai_modelscope_text2video_finetuning/
@ExponentialML Wow, truly amazing. Can't wait to try it. Thank you!
@kabachuha Didn't realize you posted it. All good, thanks for doing it!
@ExponentialML Hey can you please look at this error, for finetuning it is not able to locate the files even though they are present in that folder. Pease look at this issue I need an urgent fix for this.
I have uploaded te necessary screenshot to understand the error. @kabachuha Can you also take a look at this please.
Is there an existing issue for this?
What would your feature do ?
Is there any released training code or published paper mentioning the training methods used for this model?
Proposed workflow
N/A
Additional information
No response