Using pre-trained stable diffusion checkpoint to train temporal layers for text-to-video

Hey! I've been trying to see if I can use the text-to-video-synthesis pipelines (or some of the modules they use) in diffusers to recreate common video diffusion methods from papers that all have the common feature of starting with a pertained stable diffusion model, and adding temporal attention layers that add residual quantities to the pre-existing (and pretrained) spatial layer outputs within the Unet. The papers then fine-tune solely these temporal layers to train a text-to-video model.

I was looking to get some pointers on how I can do this on a high level, for eg. the current text-to-video pipeline uses Unet3DConditionModel but SD-1.5 uses a Unet2DConditionModel, so is the best thing to do the following: copy the stable diffusion pipeline exactly as it is, and add some modules for temporal layer processing by creating a modified unet2DConditional class with some attention layers operating on a rearranged view of the batch, keeping everything else the same. Then loading the SD-1.5 weights into the non-temporal parts of this modified unet, and training the temporal layers only (similar to how we might train LORA attention layers)? I can contribute a working example for doing the above if this is something the maintainers suggest is possible building off off modules currently present in diffusers. Thankyou!

Edit: Also is there a paper citation where I can read up about the pipeline in here?

huggingface / diffusers

Using pre-trained stable diffusion checkpoint to train temporal layers for text-to-video #5200