smvorwerk / text2video-finetuning

MIT License
1 stars 1 forks source link

Text-To-Video-Finetuning

Finetune ModelScope's Text To Video model using Diffusers 🧨

output.webm

Updates

Getting Started

Requirements & Installation

git clone https://github.com/ExponentialML/Text-To-Video-Finetuning.git
cd Text-To-Video-Finetuning
git lfs install
git clone https://huggingface.co/damo-vilab/text-to-video-ms-1.7b ./models/model_scope_diffusers/

Create Conda Environment (Optional)

It is recommended to install Anaconda.

Windows Installation: https://docs.anaconda.com/anaconda/install/windows/

Linux Installation: https://docs.anaconda.com/anaconda/install/linux/

conda create -n text2video-finetune python=3.10
conda activate text2video-finetune

Python Requirements

pip install -r requirements.txt

Hardware

All code was tested on Python 3.10.9 & Torch version 1.13.1 & 2.0.

It is highly recommended to install >= Torch 2.0. This way, you don't have to install Xformers or worry about memory performance.

If you don't have Xformers enabled, you can follow the instructions here: https://github.com/facebookresearch/xformers

Recommended to use a RTX 3090, but you should be able to train on GPUs with <= 16GB ram with:

Preprocessing your data

Using Captions

You can use caption files when training on images or video. Simply place them into a folder like so:

Images: /images/img.png /images/img.txt Videos: /videos/vid.mp4 | /videos/vid.txt

Then in your config, make sure to have -folder enabled, along with the root directory containing the files.

Process Automatically

You can automatically caption the videos using the Video-BLIP2-Preprocessor Script

Configuration

The configuration uses a YAML config borrowed from Tune-A-Video reposotories.

All configuration details are placed in configs/v2/train_config.yaml. Each parameter has a definition for what it does.

How would you recommend I proceed with making a config with my data?

I highly recommend (I did this myself) going to configs/v2/train_config.yaml. Then make a copy of it and name it whatever you wish my_train.yaml.

Then, follow each line and configure it for your specific use case.

The instructions should be clear enough to get you up and running with your dataset, but feel free to ask any questions in the discussion board.

Finetune.

python train.py --config ./configs/v2/train_config.yaml

Training Results

With a lot of data, you can expect training results to show at roughly 2500 steps at a constant learning rate of 5e-6.

When finetuning on a single video, you should see results in half as many steps.

After training, you should see your results in your output directory.

By default, it should be placed at the script root under ./outputs/train_<date>

From my testing, I recommend:

Developing

Please feel free to open a pull request if you have a feature implementation or suggesstion! I welcome all contributions.

I've tried to make the code fairly modular so you can hack away, see how the code works, and what the implementations do.

Deprecation

If you want to use the V1 repository, you can use the branch here.

Shoutouts