Closed HReynaud closed 1 year ago
@HReynaud not yet, i'm planning to build that by month's end, or by early next month
Excellent news ! Thank you for your hard work !
@HReynaud yea no problem, it will take some thought rather than just blindly throwing in a 3d conv with strides
the reason is because i want to maintain the ability to pretrain on images before video, so it'll have to follow the existing scheme of being agnostic to image / video inputs
@HReynaud should be able to get the temporal upsampling / interpolation finished tomorrow morning!
@HReynaud pray tell, what dataset are you training on?
Thank you so much ! I was considering giving it a try, but was not sure where to start. I am training on the Echonet-Dynamic dataset to do cardiac ultrasound generation: https://echonet.github.io/dynamic/ . The goal is to explore video generation with precise control over the embeddings. I am using auto-encoders and other techniques to encode specific information into a latent space that I use as the "text embeddings" for the conditional generation. I will give it a try as soon as it's live !
@HReynaud oh that's so cool! ok i'll def make sure to build this right 😃 this-mitral-valve-prolapse-does-not-exist.com lmao
@HReynaud was able to get it working, although inpainting may still not be functional (will look at that tomorrow morning) https://github.com/lucidrains/imagen-pytorch/commit/44da9be862eb026ca3edb5221a96025f552af694
@HReynaud to use it, when instantiating Imagen, just pass in temporal_downsample_factor
it will be a tuple of integers, specifying at each stage of the series of unets, how much to divide the frame size by
Awesome ! I will let you know how it goes !
@HReynaud how did it go?
The code seems to work flawlessly and temporal_downsample_factor
is very straight forward to set up, thank you for that !
(Top is sampled, bottom is ground truth)
This is the output of the upsampling model, I am going from 64x64x16 to 112x112x64. The top row is sampled from the SR model, conditioned on the low resolution ground truth videos. I have to investigate if tuning some parameters in the ElucidatedImagen
can make the speckle noise more consistent.
@HReynaud ohh beautiful, it captured the valves and septal motion so well! looking forward to reading your paper on generated cardiac pathologies :smile: :heart:
@HReynaud make sure you show that to some cardiologists, go blow their minds!
I'll make sure to ping you when the results get published somewhere 😃
Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the cond_images
parameter, ElucidatedImagen
gives the same cond_images
to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different if temporal_downsample_factor
is used.
The bug happens here. From my perspective, the solution would be to pass one cond_images
per unet when sampling, which would make it possible to have only one of the two models conditioned on cond_images
. This way cond_images
could be set per-unet in the sampling loop.
The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my cond_images
to the base unet myself, which probably means there is a similar bug in the training loop.
Hey, @HReynaud, awesome results! I'm currently working with the EchoNet dataset as well, but so far have only been getting noise. May I ask how many steps you trained your model for in total to get these videos?
Hi @alif-munim, glad to hear other people are looking into this ! Generating images of size 64x64 with a Unet
(Not 3D), takes less than an hour of training on a modern GPU ex. 3090/A5000, I would try that first. This holds with parameters left to default for the Imagen
and ImagenTrainer
modules. For the Unet
, setting dim=64
and dim_mults = (1, 2, 4)
should give good results.
For Video, I use the ElucidatedImagen
. Try leaving all parameters to default (especially ignore_time=False
) and you should get results after a few hours of compute on a GPU cluster.
Hi @lucidrains, there might be a bug with the temporal upsampling pipeline. If I want to use the
cond_images
parameter,ElucidatedImagen
gives the samecond_images
to both the base model and the super-resolution model and only resamples the spatial dimensions. But the number of frames has to be different iftemporal_downsample_factor
is used.The bug happens here. From my perspective, the solution would be to pass one
cond_images
per unet when sampling, which would make it possible to have only one of the two models conditioned oncond_images
. This waycond_images
could be set per-unet in the sampling loop.The bug might also happen during training, but as I train the models separately, I have not tested it. Nonetheless, when training the models, I take care of passing a downsampled version of my
cond_images
to the base unet myself, which probably means there is a similar bug in the training loop.
oh oops! ok, i'll get this fixed next week!
you should only have to pass in the cond_images
(should be renamed cond_images_or_video
) of what the super resoluting net receives, and it should automatically temporally downsample for the base unet during training
@HReynaud say you have two unets, the second unet temporally upsampled 2x, i'll probably make it error out if the number of frames on the conditioning video is any less than 2. is that appropriate you think? or should i also allow a single conditioning frame across all unets
@HReynaud actually, i'll just allow for that behind a feature flag, something like can_condition_on_single_frame_across_unets
(long name just to be clear)
Hi @lucidrains, For the image conditioning, I am using a single frame repeated as many times as necessary on the time dimension, so I have not practically encountered the case you mention. I guess if I were to use a video as conditioning the problem you state could arise and your solution seems sound.
If you have some time, I have been making a few small corrections / edits to the code on my fork. These are minimal edits which should be easy to track.
https://github.com/lucidrains/imagen-pytorch/compare/main...HReynaud:imagen-pytorch:main
I have not push my latest edits that correct a few bugs when using more than 2 Unets with temporal super resolution. I’ll push the commit tomorrow.
I considered making a pull request but my edits are really focused on what I am targeting and would not be general enough.
I’ll continue using your repo and will let you know if I encounter any more bugs.
@HReynaud ohh yes, there's actually two scenarios at play here
either you want to condition on an image, in which case you repeat across time and concat along the channel dimension
but the other type of conditioning would be say conditioning on a few preceding frames (much like prompting in GPT), in which case we would want to do the temporal down/upsampling and concatenate across time
but the latter can be achieved through inpainting too
ok, this is all very confusing, but i'll definitely take a look at this monday and get it squared away by end of next week!
@HReynaud hey Hadrien! i believe your issue should be resolved in the latest version! (do let me know if it hasn't)
i'll keep working on the conditioning across the time dimension, as that will allow one to generate arbitrarily long videos akin to phenaki
Hi @lucidrains, Thanks for your reactivity ! The commit looks great, I would just like to add that in resize_video_to, there is a check that prevents temporal upsampling if the spatial dimensions are untouched, ie orig_video_size = target_image_size
and target_frames != None
@HReynaud yes indeed! thank you! https://github.com/lucidrains/imagen-pytorch/commit/3c24c60904e37fc0cf572f54501f0c2c3513ffd9
Hi @HReynaud, thank you so much for your advice! I have been trying to train a simpler text-to-image model as you suggested, but upon sampling after 50 epochs of training I'm still just getting a black square :/
Could you kindly let me know how large your dataset size was and how many epochs / training steps you needed before seeing some decent samples?
Hi @alif-munim, try this script to get started: example
Thanks so much @HReynaud! Could you kindly let me know why you used trainer.train_step()
and trainer.valid_step()
over trainer.update()
? Is there a difference?
Hi @alif-munim, lucid could probably explain this more in-depth than me, but to put in simply, trainer.update()
only does the backpropagation operation ie. tensor.backward()
in pytorch.
trainer.train_step()
first runs the foward process and then automatically calls trainer.update()
to train the model. If you don't run the forward step first the model has no gradients to backpropagate through.
Thanks once again @HReynaud, I had been stuck on this issue for a while now and you've helped tremendously! I believe the issue was that I was only ever using trainer.update()
, so the model did not learn to generate anything but noise
@lucidrains, I think it would be a great idea to have @HReynaud's example script somewhere in the documentation for beginners like me :)
Hi,
Is there a way with the current library to do temporal super-resolution ? From what I can see, only the spatial super resolution is currently possible. I would like to have one UNet that generates videos with dimensions 64x64x16 and a super-resolution UNet that would upsample them to 128x128x32. Please let me know if I am missing something !