lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
MIT License
8.07k stars 766 forks source link

text2video - Comparing Notes #275

Open gowdygamble opened 1 year ago

gowdygamble commented 1 year ago

First, a big thanks to @lucidrains for all the incredible work on this and other projects!

I’m working on training text2video models using this repo + some modifications and I wanted to check with other people to share preliminary results, methods, and tips. This isn’t an issue exactly, but seemed like a reasonable place to reach people actively focused on the same problem - happy to relocate the discussion to a different channel.

My Setup:

Training Results:

here’s a gif up to ~6million training samples seen train_64_6M_resize

and here are some frames from near the beginning and near 10M (or a single epoch of webvid)

0 10 20 30

970 980 990 1000

Im actually encouraged by these results, despite the fact that they're all just cloudy mush! The model is learning something, seems to be getting a handle on shapes and textures from the world of short internet video clips. A few notes/questions:

Data I/O Bottleneck


ok, got a bit long.

TLDR:

TheFusion21 commented 1 year ago

First of all great work looks very promising.

I/O Bottleneck. How does you're dataloader look like? Do you use workers and prefetching? It helped me a lot when the data is on spinning drives.

FVD Clip would be awesome as the loss is pretty much useless.

as @lucidrains mentioned in other issues. It is recommended to get in contact with people from Stability.ai as they have more than capable power to get this going.

Im currently just doing text2image with my own dataset. Don't know much about text2video

gowdygamble commented 1 year ago

Thanks!

My dataloader approach:

I've heard that Stability is working on text2video already and I've got to believe that they have more experienced folks than me working on it. I know they do sponsorships and stuff, and I'm not opposed to reaching out to them at some point.

Sounds like a cool text2image project with pokemon - good luck man!

TheFusion21 commented 1 year ago

A few more thought I had.

How compressed are you image/video files and which file format? @lucidrains suggested memorymapping for the data at least for the embeds maybe it is even useful to do so with the images/video.

gowdygamble commented 1 year ago

Not compressed at all currently, just mp4s and fp16 saved arrays for the captions. I read with cv2, skip to a random start frame, and then read every Nth frame (currently just reading every other, but eventually want to push this out once video dynamics become the blocker). Tried saving individual frames which did give a speedup but required like 10x space and would have meant buying more big drives.

Played around with WebDataset (still using it for Laion 400M), but couldn't get any speedup from initial experiments.

Messed around with this too: https://pytorch.org/vision/main/auto_examples/plot_video_api.html but ended up being a bit of a rabbit hole.

Mem mapping is great idea, I'll take a look. Never really used it before. Thanks for the suggestion.

TheFusion21 commented 1 year ago

I will work in the next couple of days on memory mapping the data optionally before training probably even the image data if it's not to much overhead

cyrilzakka commented 1 year ago

@bigbundus I'm actually having issues with this implementation failing to converge with unconditional video generation. Have you had any luck getting realistic looking videos using text-guidance? If so, would you mind sharing your code?

jaded0 commented 1 year ago

So far, I'm running on just the 16px images so that I can optimize the training speed for higher resolutions later. I have gotten really good results on 16px images in the past, but that occurs after three straight days of training. 64px, no downsample took three months until I stopped it. By the end, the animated videos had some recognizable faces and arm movements, but still looked like a sandstorm. I found that I can't seem to push GPU utilization past 40%. When I use larger batch sizes it pulls GPU memory up, but gpu 0 is something like double the others, and the overall training speed lowers by a lot. I very well may have a slow dataloader. At no point before passing my Dataset into dataloader do I load anything manually onto cuda. I haven't checked whether that's slowing anything down.

Here's a look at current result where I'm comparing different hyperparameters: https://wandb.ai/jadens_team/vid-signs?workspace=user-jaden-lorenc rn it's looking like a lower downsample_factor is the direction.

Should I try training on images alone?
Maybe I'd make better use of my resources with a larger unet? I've seen that both slow down and speed up training on other projects.
I'm honestly really confused that larger batch size seems to slow down the training, I though the opposite was generally true.

alif-munim commented 1 year ago

Hey everyone! I've just gotten my training loop running on a medical video dataset (downsampled to 64x64), and it's been going for around 3 days now (over 100k steps, and a dataset of ~1000 videos). It seems like it's learning something, but the results aren't all that great:

imagenvideo_sample-0000117500

I was wondering how many training steps it took for you all until you got good results. Any insight would be greatly appreciated!

Ytimed2020 commented 1 year ago

@alif-munim Hello, I am currently working on a project similar to yours. May I know how your situation is now