Closed gunshi closed 1 year ago
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey! I've been trying to see if I can use the text-to-video-synthesis pipelines (or some of the modules they use) in diffusers to recreate common video diffusion methods from papers that all have the common feature of starting with a pertained stable diffusion model, and adding temporal attention layers that add residual quantities to the pre-existing (and pretrained) spatial layer outputs within the Unet. The papers then fine-tune solely these temporal layers to train a text-to-video model.
I was looking to get some pointers on how I can do this on a high level, for eg. the current text-to-video pipeline uses Unet3DConditionModel but SD-1.5 uses a Unet2DConditionModel, so is the best thing to do the following: copy the stable diffusion pipeline exactly as it is, and add some modules for temporal layer processing by creating a modified unet2DConditional class with some attention layers operating on a rearranged view of the batch, keeping everything else the same. Then loading the SD-1.5 weights into the non-temporal parts of this modified unet, and training the temporal layers only (similar to how we might train LORA attention layers)? I can contribute a working example for doing the above if this is something the maintainers suggest is possible building off off modules currently present in diffusers. Thankyou!
Edit: Also is there a paper citation where I can read up about the pipeline in here?