opensora-v1.1 fintune has abnormal video effect

buxianggaimingzi commented 4 months ago

Describe the bug

I tried to use the Inter4k data set to understand the data processing and training process of opensora-v1.1. According to the guidance of the document, I generated the csv file corresponding to the Inter4k data, and performed fintune training according to the docs. https://github.com/hpcaitech/Open-Sora/blob/09e53db185efa76a658ce1e62523cba94bf7ab9b/README.md?plain=1#L301-L302 The basic model uses OpenSora-STDiT-v2-stage3, but I don't know what's going on, the video generated by finetune model is very poor, and it looks like there are some bugs.

https://github.com/hpcaitech/Open-Sora/assets/135929703/7dc09c36-092c-4ea4-864f-8e2a9a81a116

and I only trained 1k steps, I don’t know if it has something to do with this. But since Inter4k is in your training data, I thought using Inter4k data for finetune should not have such abnormal results even with very few training steps?

Reproduction

btw, there seems to be some problems in the process of generating csv. When I execute the caption_llava script, the generated csv only has path, text, num_frames. I manually merge the height, width and other information into the final csv. you can ignore the generation of csv, I post the first few rows of data of csv:

path,text,num_frames,height,width,aspect_ratio,fps,resolution,aes
/path/to/Inter4K/clips/99_scene-2.mp4,"a vibrant and dynamic performance of a band on stage. the stage is illuminated with bright blue and red lights, creating a striking contrast against the black background. the band members are actively engaged in their performance, with one member playing the drums, another on the bass, and a third on the keyboard. the stage is adorned with a large, intricate metal structure that adds to the visual appeal of the performance. the band members are dressed in colorful outfits, further enhancing the lively atmosphere of the concert.",60,720,1280,0.5625,30.0,921600,5.195915222167969
/path/to/Inter4K/clips/749_scene-0.mp4,"a close-up shot of a pasta maker in action. the pasta maker is silver and has a cylindrical shape. it is filled with white dough, which is being extruded through a small opening at the bottom. the dough is being cut into long, thin strands, which are being deposited onto a plate. the plate is round and has a silver rim. the background is black, which contrasts with the white dough and silver objects. the style of the video is realistic and it appears to be a still image rather than a moving one.",60,720,1280,0.5625,30.0,921600,5.536831855773926

All instructions are executed in ./scripts train:

torchrun --standalone --nproc_per_node 4 train.py ../configs/opensora-v1-1/train/stage3.py --ckpt-path /path/to/OpenSora-STDiT-v2-stage3/ --data-path /path/to/Inter4K/meta_clips_caption_cleaned_final.csv

run inference:

python inference-long.py ../configs/opensora-v1-1/inference/sample.py --ckpt-path ./outputs/007-STDiT2-XL-2/epoch6-global_step1000/model/ --num-frames 32 --image-size 240 426 --sample-name image-inter-fintune0510-wave-240-426 --prompt '{"reference_path": "../assets/images/condition/wave.png","mask_strategy": "0"}'

Logs

There are a lot of inference logs, and no errors are reported. Maybe the logs are not helpful. I can provide more log snippets if you need them.

System Info

There are a lot of pkgs and a lot of useless information, so I only post the versions of key libraries. accelerate 0.21.0 apex 0.1 clip 1.0 colossalai 0.3.7 einops 0.6.1 mmengine 0.10.4 oauthlib 3.2.2 opencv-python 4.9.0.80 pandas 2.2.2 PyYAML 6.0.1 safetensors 0.4.3 scenedetect 0.6.3 scikit-learn 1.2.2 scipy 1.13.0 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 timm 0.6.13 tokenizers 0.15.1 torch 2.2.2 torchaudio 2.2.2+cu121 torchvision 0.17.2+cu121 tqdm 4.66.2 transformers 4.37.2 wandb 0.16.6 xformers 0.0.25.post1

buxianggaimingzi commented 4 months ago

oh sry... it is indeed due to the small number of training steps. when I trained to 3k steps, the video effect looked more normal...

https://github.com/hpcaitech/Open-Sora/assets/135929703/9a80a05a-de83-4c02-9c8f-799d4f65fc7e

tonney007 commented 4 months ago

请问训练最少需要多少显存，一直报oom

Weixiang-Sun commented 4 months ago

请问训练最少需要多少显存，一直报oom

I successfully trained on 2 A100-80G