Video temporal upsampling

HReynaud commented 1 year ago

Hi,

Is there a way with the current library to do temporal super-resolution ? From what I can see, only the spatial super resolution is currently possible. I would like to have one UNet that generates videos with dimensions 64x64x16 and a super-resolution UNet that would upsample them to 128x128x32. Please let me know if I am missing something !

lucidrains commented 1 year ago

@HReynaud not yet, i'm planning to build that by month's end, or by early next month

HReynaud commented 1 year ago

Excellent news ! Thank you for your hard work !

lucidrains commented 1 year ago

@HReynaud yea no problem, it will take some thought rather than just blindly throwing in a 3d conv with strides

the reason is because i want to maintain the ability to pretrain on images before video, so it'll have to follow the existing scheme of being agnostic to image / video inputs

lucidrains commented 1 year ago

@HReynaud should be able to get the temporal upsampling / interpolation finished tomorrow morning!

lucidrains commented 1 year ago

@HReynaud pray tell, what dataset are you training on?

HReynaud commented 1 year ago

Thank you so much ! I was considering giving it a try, but was not sure where to start. I am training on the Echonet-Dynamic dataset to do cardiac ultrasound generation: https://echonet.github.io/dynamic/ . The goal is to explore video generation with precise control over the embeddings. I am using auto-encoders and other techniques to encode specific information into a latent space that I use as the "text embeddings" for the conditional generation. I will give it a try as soon as it's live !

lucidrains commented 1 year ago

@HReynaud oh that's so cool! ok i'll def make sure to build this right 😃 this-mitral-valve-prolapse-does-not-exist.com lmao

lucidrains commented 1 year ago

@HReynaud was able to get it working, although inpainting may still not be functional (will look at that tomorrow morning) https://github.com/lucidrains/imagen-pytorch/commit/44da9be862eb026ca3edb5221a96025f552af694

lucidrains commented 1 year ago

@HReynaud to use it, when instantiating Imagen, just pass in temporal_downsample_factor

it will be a tuple of integers, specifying at each stage of the series of unets, how much to divide the frame size by

HReynaud commented 1 year ago

Awesome ! I will let you know how it goes !

lucidrains commented 1 year ago

@HReynaud how did it go?

HReynaud commented 1 year ago

The code seems to work flawlessly and temporal_downsample_factor is very straight forward to set up, thank you for that !

videos_10004_f59356ee9aae088a26ee

(Top is sampled, bottom is ground truth)

This is the output of the upsampling model, I am going from 64x64x16 to 112x112x64. The top row is sampled from the SR model, conditioned on the low resolution ground truth videos. I have to investigate if tuning some parameters in the ElucidatedImagen can make the speckle noise more consistent.

lucidrains commented 1 year ago

@HReynaud ohh beautiful, it captured the valves and septal motion so well! looking forward to reading your paper on generated cardiac pathologies :smile: :heart:

lucidrains commented 1 year ago

@HReynaud make sure you show that to some cardiologists, go blow their minds!

HReynaud commented 1 year ago

I'll make sure to ping you when the results get published somewhere 😃

HReynaud commented 1 year ago

Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the cond_images parameter, ElucidatedImagen gives the same cond_images to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different if temporal_downsample_factor is used.

The bug happens here. From my perspective, the solution would be to pass one cond_images per unet when sampling, which would make it possible to have only one of the two models conditioned on cond_images. This way cond_images could be set per-unet in the sampling loop.

The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my cond_images to the base unet myself, which probably means there is a similar bug in the training loop.

alif-munim commented 1 year ago

Hey, @HReynaud, awesome results! I'm currently working with the EchoNet dataset as well, but so far have only been getting noise. May I ask how many steps you trained your model for in total to get these videos?

HReynaud commented 1 year ago

Hi @alif-munim, glad to hear other people are looking into this ! Generating images of size 64x64 with a Unet (Not 3D), takes less than an hour of training on a modern GPU ex. 3090/A5000, I would try that first. This holds with parameters left to default for the Imagen and ImagenTrainer modules. For the Unet, setting dim=64 and dim_mults = (1, 2, 4) should give good results.

For Video, I use the ElucidatedImagen. Try leaving all parameters to default (especially ignore_time=False) and you should get results after a few hours of compute on a GPU cluster.

lucidrains commented 1 year ago

Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the cond_images parameter, ElucidatedImagen gives the same cond_images to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different if temporal_downsample_factor is used.

The bug happens here. From my perspective, the solution would be to pass one cond_images per unet when sampling, which would make it possible to have only one of the two models conditioned on cond_images. This way cond_images could be set per-unet in the sampling loop.

The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my cond_images to the base unet myself, which probably means there is a similar bug in the training loop.

oh oops! ok, i'll get this fixed next week!

you should only have to pass in the cond_images (should be renamed cond_images_or_video) of what the super resoluting net receives, and it should automatically temporally downsample for the base unet during training

lucidrains commented 1 year ago

@HReynaud say you have two unets, the second unet temporally upsampled 2x, i'll probably make it error out if the number of frames on the conditioning video is any less than 2. is that appropriate you think? or should i also allow a single conditioning frame across all unets

lucidrains commented 1 year ago

@HReynaud actually, i'll just allow for that behind a feature flag, something like can_condition_on_single_frame_across_unets (long name just to be clear)

HReynaud commented 1 year ago

Hi @lucidrains, For the image conditioning, I am using a single frame repeated as many times as necessary on the time dimension, so I have not practically encountered the case you mention. I guess if I were to use a video as conditioning the problem you state could arise and your solution seems sound.

If you have some time, I have been making a few small corrections / edits to the code on my fork. These are minimal edits which should be easy to track.

https://github.com/lucidrains/imagen-pytorch/compare/main...HReynaud:imagen-pytorch:main

I have not push my latest edits that correct a few bugs when using more than 2 Unets with temporal super resolution. I’ll push the commit tomorrow.

I considered making a pull request but my edits are really focused on what I am targeting and would not be general enough.

I’ll continue using your repo and will let you know if I encounter any more bugs.

lucidrains commented 1 year ago

@HReynaud ohh yes, there's actually two scenarios at play here

either you want to condition on an image, in which case you repeat across time and concat along the channel dimension

but the other type of conditioning would be say conditioning on a few preceding frames (much like prompting in GPT), in which case we would want to do the temporal down/upsampling and concatenate across time

lucidrains commented 1 year ago

but the latter can be achieved through inpainting too

ok, this is all very confusing, but i'll definitely take a look at this monday and get it squared away by end of next week!

lucidrains commented 1 year ago

@HReynaud hey Hadrien! i believe your issue should be resolved in the latest version! (do let me know if it hasn't)

i'll keep working on the conditioning across the time dimension, as that will allow one to generate arbitrarily long videos akin to phenaki

HReynaud commented 1 year ago

Hi @lucidrains, Thanks for your reactivity ! The commit looks great, I would just like to add that in resize_video_to, there is a check that prevents temporal upsampling if the spatial dimensions are untouched, ie orig_video_size = target_image_size and target_frames != None

lucidrains commented 1 year ago

@HReynaud yes indeed! thank you! https://github.com/lucidrains/imagen-pytorch/commit/3c24c60904e37fc0cf572f54501f0c2c3513ffd9

alif-munim commented 1 year ago

Hi @HReynaud, thank you so much for your advice! I have been trying to train a simpler text-to-image model as you suggested, but upon sampling after 50 epochs of training I'm still just getting a black square :/

Could you kindly let me know how large your dataset size was and how many epochs / training steps you needed before seeing some decent samples?

HReynaud commented 1 year ago

Hi @alif-munim, try this script to get started: example

alif-munim commented 1 year ago

Hi @alif-munim, try this script to get started: example

Thanks so much @HReynaud! Could you kindly let me know why you used trainer.train_step() and trainer.valid_step() over trainer.update()? Is there a difference?

HReynaud commented 1 year ago

Hi @alif-munim, lucid could probably explain this more in-depth than me, but to put in simply, trainer.update() only does the backpropagation operation ie. tensor.backward() in pytorch.

trainer.train_step() first runs the foward process and then automatically calls trainer.update() to train the model. If you don't run the forward step first the model has no gradients to backpropagate through.

alif-munim commented 1 year ago

Thanks once again @HReynaud, I had been stuck on this issue for a while now and you've helped tremendously! I believe the issue was that I was only ever using trainer.update(), so the model did not learn to generate anything but noise

@lucidrains, I think it would be a great idea to have @HReynaud's example script somewhere in the documentation for beginners like me :)

lucidrains / imagen-pytorch

Video temporal upsampling #305