text2video - Comparing Notes

gowdygamble commented 1 year ago

First, a big thanks to @lucidrains for all the incredible work on this and other projects!

I’m working on training text2video models using this repo + some modifications and I wanted to check with other people to share preliminary results, methods, and tips. This isn’t an issue exactly, but seemed like a reasonable place to reach people actively focused on the same problem - happy to relocate the discussion to a different channel.

My Setup:

training locally with 2 RTX 3090s
So far focusing exclusively on the initial video diffusion model, will deal with upsampling if I ever get something compelling
encoding dim 128, with standard dim_mults [1,2,4,8]. This gives ~150M params (the lowest the paper reports results on is 500M but they go up to 5B which is beyond my capacity atm)
using Webvid 10M + pre-encoded captions using T5 Large (imagen-video paper says T5-XXL encodings were important for getting their best results, but I’ll deal with that if it ever becomes the blocker…)
I have ~200M of Laion-400M downloaded and am working on pre-encoding captions but I’m running into disk space issues, more on that below.

Training Results:

loss falls quickly and levels off. Not a great metric given the open-ended nature of the learning task. Eventually, if I ever start getting interesting results, I’ll implement the generative quality metrics they report in the papers, FVD, CLIP, etc.
periodically throwing off validation frames for a fixed set of prompts. the figures below are just the first frame from each prompt’s 10-frame sequence (the other 9 frames are basically just the same).

here’s a gif up to ~6million training samples seen train_64_6M_resize

and here are some frames from near the beginning and near 10M (or a single epoch of webvid)

970 980 990 1000

Im actually encouraged by these results, despite the fact that they're all just cloudy mush! The model is learning something, seems to be getting a handle on shapes and textures from the world of short internet video clips. A few notes/questions:

when to expect basic color-text associations: admittedly Im still extremely early in training, but I keep wondering when the model will start to put a white-ish blob in the “white house” prompt, or red/green splotches in the “red and green fireworks” prompt. After a full pass through webvid Im sure its seen these associations (text “red” -> red pixels) several times, but maybe not enough to cement them in over noise
high variability: the output varies significantly even from batch to batch. I’d expect this early in training (which again, tbf, I still very much am), but I kinda think for a fixed prompt the model would have converged onto a ‘best guess’ region by now and would be sort of drifting from there. But it seems to just be random. I'm using the default Adam params from this repo, which seem to match the paper. Haven’t checked any gradients yet for trouble

Data I/O Bottleneck

my main bottleneck is reading the dataset from disk during training. Using a homebrewed multiprocess dataloader scheme I can hit ~100 samples/s (using batch_size 64 split across the 2 gpus). Pushing the limits here has been the biggest area of personal learning for me on this project. I’ve got a local-network grpc server/client system running, haven’t yet integrated it into training, but I think it might allow me to hit the GPU bottleneck by just keeping a queue of batches fully loaded from as many spare PCs as I have. One minor issue here: I’m going to run into differential sampling rates: i.e. some portions of the dataset are read into the training pipeline much faster so over time they’ll be seen more often. I think I can correct for this but frankly for now I’m just noting it and I’ll deal with it if I start getting better results
looking at the original video diffusion models paper, supplement A, for the “small 16x64x64” model (Im doing 64x64 but only 10 frames), they report 200,000 training steps with a batch size of 128. This is probably the per TPU batch size, but lets just say its the total batch size (given that they use 64 TPUs, that would be a per TPU batch size of 2 which seems definitely wrong but bear with me) - so thats 25M samples seen (realistically thats probably per device, yikes!). They do still image training by appending frames to the end of video samples, Im planning on doing them as separate samples but lets just ignore that for now. So thats like 3 days at my current throughput to hit 25M, not bad but probably not fast enough to actually iterate on model details and get something useable, especially not when you consider Ive got to also train all the upsamplers!
Basically just accepting the fact that if I want to do this seriously I have to take into the cloud. I actually dont think it would be crazy expensive if I knew exactly what I wanted to run and I could just do it once or twice, but the issue is experimentation of course, so I really want to nail down my pipeline locally. I’m also playing around with feeding cloud accelerators from local GRPC servers, but I dont think itll be fast enough.

ok, got a bit long.

TLDR:

just train more
probably not feasible to do locally. Debating a move to AWS but not sure its worth the $ for what I'm currently approaching as hobby project.
I'd love to hear about anyone else's experience with text2video training!

TheFusion21 commented 1 year ago

First of all great work looks very promising.

I/O Bottleneck. How does you're dataloader look like? Do you use workers and prefetching? It helped me a lot when the data is on spinning drives.

FVD Clip would be awesome as the loss is pretty much useless.

as @lucidrains mentioned in other issues. It is recommended to get in contact with people from Stability.ai as they have more than capable power to get this going.

Im currently just doing text2image with my own dataset. Don't know much about text2video

gowdygamble commented 1 year ago

Thanks!

My dataloader approach:

Independently from the training script, I kick off a set of processes which read video/caption files from disk and add processed samples to a queue.
From the training job I create an iterable dataset that reads from the queue using grpc (or socket) or waits and retries if its empty, then I consume this dataset in a dataloader as normal
I separated these since accelerate seems to just run the whole training script twice, which would then duplicate the whole multiprocess data queue system. Now accelerate just duplicates the client/queue-reading dataset, which is what I want.
currently the GPUs outpace the data-load workers so the GPUs idle between batches, but the queue continues to fill during validation inference, saving, etc.
I tried the standard dataloader num_workers and prefetch factor, but these paradoxically caused a slowdown. After some reading, this has been reported by others, and likely relates to I/O waiting. I have a decent NVMe SSD that holds a big chunk of webvid, but I had to put the rest + captions onto some HDDs. Some simple experiment suggest that on the NVMe alone adding workers does speed up I/O to a point, but as soon as you add in the HDDs, more workers doesnt seem to help (at least in my naive approach)
mentioned above, but Im close to implementing a local-network based grpc system that can feed the queue from an arbitrary number of workers, hopefully pushing the bottleneck back into the GPUs.

I've heard that Stability is working on text2video already and I've got to believe that they have more experienced folks than me working on it. I know they do sponsorships and stuff, and I'm not opposed to reaching out to them at some point.

Sounds like a cool text2image project with pokemon - good luck man!

TheFusion21 commented 1 year ago

A few more thought I had.

How compressed are you image/video files and which file format? @lucidrains suggested memorymapping for the data at least for the embeds maybe it is even useful to do so with the images/video.

gowdygamble commented 1 year ago

Not compressed at all currently, just mp4s and fp16 saved arrays for the captions. I read with cv2, skip to a random start frame, and then read every Nth frame (currently just reading every other, but eventually want to push this out once video dynamics become the blocker). Tried saving individual frames which did give a speedup but required like 10x space and would have meant buying more big drives.

Played around with WebDataset (still using it for Laion 400M), but couldn't get any speedup from initial experiments.

Messed around with this too: https://pytorch.org/vision/main/auto_examples/plot_video_api.html but ended up being a bit of a rabbit hole.

Mem mapping is great idea, I'll take a look. Never really used it before. Thanks for the suggestion.

TheFusion21 commented 1 year ago

I will work in the next couple of days on memory mapping the data optionally before training probably even the image data if it's not to much overhead

cyrilzakka commented 1 year ago

@bigbundus I'm actually having issues with this implementation failing to converge with unconditional video generation. Have you had any luck getting realistic looking videos using text-guidance? If so, would you mind sharing your code?

jaded0 commented 1 year ago

I'm using the word-level american sign language dataset of about 12k videos (WLASL)
On 8 A100s from my university. (80gb of memory each)
unet1_dim = 128
unet2_dim = 256
downsample_factor = 8 (I think this takes evenly-spaced frames from the video, separated by 8 frames not taken)
num_frames = 64 (this is the number for frames chose per video)
image_sizes = (16, 32), I tried 64, but it's dreadfully slow.
batch_size=8
My dataloader starts at a random frame, and returns the next num_frames frames.
No transformations.

So far, I'm running on just the 16px images so that I can optimize the training speed for higher resolutions later. I have gotten really good results on 16px images in the past, but that occurs after three straight days of training. 64px, no downsample took three months until I stopped it. By the end, the animated videos had some recognizable faces and arm movements, but still looked like a sandstorm. I found that I can't seem to push GPU utilization past 40%. When I use larger batch sizes it pulls GPU memory up, but gpu 0 is something like double the others, and the overall training speed lowers by a lot. I very well may have a slow dataloader. At no point before passing my Dataset into dataloader do I load anything manually onto cuda. I haven't checked whether that's slowing anything down.

Here's a look at current result where I'm comparing different hyperparameters: https://wandb.ai/jadens_team/vid-signs?workspace=user-jaden-lorenc rn it's looking like a lower downsample_factor is the direction.

Should I try training on images alone?
Maybe I'd make better use of my resources with a larger unet? I've seen that both slow down and speed up training on other projects.
I'm honestly really confused that larger batch size seems to slow down the training, I though the opposite was generally true.

alif-munim commented 1 year ago

Hey everyone! I've just gotten my training loop running on a medical video dataset (downsampled to 64x64), and it's been going for around 3 days now (over 100k steps, and a dataset of ~1000 videos). It seems like it's learning something, but the results aren't all that great:

imagenvideo_sample-0000117500

I was wondering how many training steps it took for you all until you got good results. Any insight would be greatly appreciated!

Ytimed2020 commented 1 year ago

@alif-munim Hello, I am currently working on a project similar to yours. May I know how your situation is now

lucidrains / imagen-pytorch

text2video - Comparing Notes #275