lucidrains / phenaki-pytorch

Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch
MIT License
740 stars 78 forks source link

Running out of CUDA/GPU spaces #14

Open gmegh opened 1 year ago

gmegh commented 1 year ago

I have a GPU with 15GB and it seems it runs out of space when I try to train the network with 50 videos at a time. Do you think it would be better to repeat the loss training video per video, instead of all the videos at once?

gmegh commented 1 year ago

Additionally, when training on 20 videos and text prompts the model output is still just noise, which I think is the expected result, given the lack of training, right?

lucidrains commented 1 year ago

@gmegh yea, training on video won't be a cakewalk

also, before the wip flag is removed, the network is still very alpha

i plan on making the network agnostic to image or video training, and start with images first. realistically, for this to be trained successfully outside of google, it would need to be pretrained on images

gmegh commented 1 year ago

Yes, that makes, sense. Let me know if I can help. Do you know when are you planning on having the agnostic feature ready?

I did create some short functions to be able to use .mp4 instead of just gifs and saved the tensors to mp4 as well. Let me know if you would like for me to add them to a PR

lucidrains commented 1 year ago

@gmegh so i have to add 3d continuous relative positional bias to the maskgit embedding to allow for generalization to different sizes. i think i should be able to get it done by tomorrow evening

re: mp4 - yes! that would be super helpful!

gmegh commented 1 year ago

Great! I will create a PR.

Also for reference, these guys are also working on implementing it: https://github.com/LAION-AI/phenaki

I think another nice to-do would be to allow for saving the trained model and be able to load it

lucidrains commented 1 year ago

@gmegh yup, i've been chatting with Dominic

they are planning on straying a bit farther from the paper's implementation (for example, using all convolutions in the cvivit)

but this is a joint effort; anything i develop here they are free to use

lucidrains commented 1 year ago

@gmegh yea, i'll definitely get to the training code soon, once i add a few more bells and whistles to the attention networks

gmegh commented 1 year ago

Awesome! Happy to help if you want.

lucidrains commented 1 year ago

@gmegh yea definitely welcome any help!

do you know of any good packages for processing and loading video data?

gmegh commented 1 year ago

@lucidrains Yes! I think cv2 is a good package. I made some quick functions with it that I have added to the new PR. The crop_image() should probably be edited further

gmegh commented 1 year ago

What is the status of the code right now? I think the checkboxes in the readme are outdated, right?

lucidrains commented 1 year ago

@gmegh the code will be in a very good place by the end of the week, and by end of next week, all the training code will be there

lucidrains commented 1 year ago

@gmegh usually there is some back and forth and whittling away at bugs for about a month or so after i remove the wip, but that's usually a fast process as i like to iterate quickly

lucidrains commented 1 year ago

@gmegh for training on my end, i plan to get it to a place where the framework can produce unconditional (or text conditioned) images by end of the week

that part i know very well from my other works

lucidrains commented 1 year ago

@gmegh feel free to experiment in the mean time!

gmegh commented 1 year ago

Hi @lucidrains ! Is the framework that can produce unconditional (or text conditioned) images ready? I am experimenting with the current version and I would need a way to train by batches, because using 500 videos at a time already fills up my CUDA memory. Any idea on how to go about this?

cyrilzakka commented 1 year ago

@gmegh yea definitely welcome any help!

do you know of any good packages for processing and loading video data?

@lucidrains I could take care of this. Any preferences as to whether you'd like to break down each video into frames, or sample from a video directly?