hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.77k stars 2.11k forks source link

opensora-v1.1 fintune has abnormal video effect #388

Closed buxianggaimingzi closed 4 months ago

buxianggaimingzi commented 4 months ago

Describe the bug

I tried to use the Inter4k data set to understand the data processing and training process of opensora-v1.1. According to the guidance of the document, I generated the csv file corresponding to the Inter4k data, and performed fintune training according to the docs. https://github.com/hpcaitech/Open-Sora/blob/09e53db185efa76a658ce1e62523cba94bf7ab9b/README.md?plain=1#L301-L302 The basic model uses OpenSora-STDiT-v2-stage3, but I don't know what's going on, the video generated by finetune model is very poor, and it looks like there are some bugs.

https://github.com/hpcaitech/Open-Sora/assets/135929703/7dc09c36-092c-4ea4-864f-8e2a9a81a116

and I only trained 1k steps, I don’t know if it has something to do with this. But since Inter4k is in your training data, I thought using Inter4k data for finetune should not have such abnormal results even with very few training steps?

Reproduction

btw, there seems to be some problems in the process of generating csv. When I execute the caption_llava script, the generated csv only has path, text, num_frames. I manually merge the height, width and other information into the final csv. you can ignore the generation of csv, I post the first few rows of data of csv:

path,text,num_frames,height,width,aspect_ratio,fps,resolution,aes
/path/to/Inter4K/clips/99_scene-2.mp4,"a vibrant and dynamic performance of a band on stage. the stage is illuminated with bright blue and red lights, creating a striking contrast against the black background. the band members are actively engaged in their performance, with one member playing the drums, another on the bass, and a third on the keyboard. the stage is adorned with a large, intricate metal structure that adds to the visual appeal of the performance. the band members are dressed in colorful outfits, further enhancing the lively atmosphere of the concert.",60,720,1280,0.5625,30.0,921600,5.195915222167969
/path/to/Inter4K/clips/749_scene-0.mp4,"a close-up shot of a pasta maker in action. the pasta maker is silver and has a cylindrical shape. it is filled with white dough, which is being extruded through a small opening at the bottom. the dough is being cut into long, thin strands, which are being deposited onto a plate. the plate is round and has a silver rim. the background is black, which contrasts with the white dough and silver objects. the style of the video is realistic and it appears to be a still image rather than a moving one.",60,720,1280,0.5625,30.0,921600,5.536831855773926

All instructions are executed in ./scripts train:

torchrun --standalone --nproc_per_node 4 train.py ../configs/opensora-v1-1/train/stage3.py --ckpt-path /path/to/OpenSora-STDiT-v2-stage3/ --data-path /path/to/Inter4K/meta_clips_caption_cleaned_final.csv

run inference:

python inference-long.py ../configs/opensora-v1-1/inference/sample.py --ckpt-path ./outputs/007-STDiT2-XL-2/epoch6-global_step1000/model/ --num-frames 32 --image-size 240 426 --sample-name image-inter-fintune0510-wave-240-426 --prompt '{"reference_path": "../assets/images/condition/wave.png","mask_strategy": "0"}'

Logs

There are a lot of inference logs, and no errors are reported. Maybe the logs are not helpful. I can provide more log snippets if you need them.

System Info

There are a lot of pkgs and a lot of useless information, so I only post the versions of key libraries. accelerate 0.21.0 apex 0.1 clip 1.0 colossalai 0.3.7 einops 0.6.1 mmengine 0.10.4 oauthlib 3.2.2 opencv-python 4.9.0.80 pandas 2.2.2 PyYAML 6.0.1 safetensors 0.4.3 scenedetect 0.6.3 scikit-learn 1.2.2 scipy 1.13.0 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 timm 0.6.13 tokenizers 0.15.1 torch 2.2.2 torchaudio 2.2.2+cu121 torchvision 0.17.2+cu121 tqdm 4.66.2 transformers 4.37.2 wandb 0.16.6 xformers 0.0.25.post1

buxianggaimingzi commented 4 months ago

oh sry... it is indeed due to the small number of training steps. when I trained to 3k steps, the video effect looked more normal...

https://github.com/hpcaitech/Open-Sora/assets/135929703/9a80a05a-de83-4c02-9c8f-799d4f65fc7e

tonney007 commented 4 months ago

请问训练最少需要多少显存,一直报oom

Weixiang-Sun commented 4 months ago

请问训练最少需要多少显存,一直报oom

I successfully trained on 2 A100-80G

fenghe12 commented 4 months ago

请问训练最少需要多少显存,一直报oom

I successfully trained on 2 A100-80G 您碰掉过colossalai报错的问题吗

Weixiang-Sun commented 4 months ago

请问训练最少需要多少显存,一直报oom

I successfully trained on 2 A100-80G 您碰掉过colossalai报错的问题吗

Please more specific

fenghe12 commented 4 months ago

No module named 'colossalai._C.fused_optim_cuda 这个是我遇到的报错

Weixiang-Sun commented 4 months ago

No module named 'colossalai._C.fused_optim_cuda 这个是我遇到的报错

Maybe CUDA version problem

fenghe12 commented 4 months ago

cu 11.8

fenghe12 commented 4 months ago

应该不是cuda的问题

fenghe12 commented 4 months ago

或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

fenghe12 commented 4 months ago

我执行以下的训练命令 torchrun --standalone --nproc_per_node 8 scripts/train.py configs/opensora-v1-1/train/stage1.py --data-path -

fenghe12 commented 4 months ago

![Uploading 1717599476554.jpg…]()

syc11-25 commented 3 months ago

或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

请问,你解决这个问题了吗?

GautamV234 commented 3 days ago

Hi @buxianggaimingzi ,

I intend to finetune open-sora for one of my research projects. Can you please guide me with the exact steps for it, I seem to get lost in the documentation and was having a hard time figuring it out. Also please specify what was the GPU specs you used for finetuning, I have 4 nvidia 4090s at my disposal. Thanks a lot!