Closed buxianggaimingzi closed 4 months ago
oh sry... it is indeed due to the small number of training steps. when I trained to 3k steps, the video effect looked more normal...
https://github.com/hpcaitech/Open-Sora/assets/135929703/9a80a05a-de83-4c02-9c8f-799d4f65fc7e
请问训练最少需要多少显存,一直报oom
请问训练最少需要多少显存,一直报oom
I successfully trained on 2 A100-80G
请问训练最少需要多少显存,一直报oom
I successfully trained on 2 A100-80G 您碰掉过colossalai报错的问题吗
请问训练最少需要多少显存,一直报oom
I successfully trained on 2 A100-80G 您碰掉过colossalai报错的问题吗
Please more specific
No module named 'colossalai._C.fused_optim_cuda 这个是我遇到的报错
No module named 'colossalai._C.fused_optim_cuda 这个是我遇到的报错
Maybe CUDA version problem
cu 11.8
应该不是cuda的问题
或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
我执行以下的训练命令 torchrun --standalone --nproc_per_node 8 scripts/train.py configs/opensora-v1-1/train/stage1.py --data-path -
![Uploading 1717599476554.jpg…]()
或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
或者说 一直在执行这里 [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
请问,你解决这个问题了吗?
Hi @buxianggaimingzi ,
I intend to finetune open-sora for one of my research projects. Can you please guide me with the exact steps for it, I seem to get lost in the documentation and was having a hard time figuring it out. Also please specify what was the GPU specs you used for finetuning, I have 4 nvidia 4090s at my disposal. Thanks a lot!
Describe the bug
I tried to use the Inter4k data set to understand the data processing and training process of opensora-v1.1. According to the guidance of the document, I generated the csv file corresponding to the Inter4k data, and performed fintune training according to the docs. https://github.com/hpcaitech/Open-Sora/blob/09e53db185efa76a658ce1e62523cba94bf7ab9b/README.md?plain=1#L301-L302 The basic model uses OpenSora-STDiT-v2-stage3, but I don't know what's going on, the video generated by finetune model is very poor, and it looks like there are some bugs.
https://github.com/hpcaitech/Open-Sora/assets/135929703/7dc09c36-092c-4ea4-864f-8e2a9a81a116
and I only trained 1k steps, I don’t know if it has something to do with this. But since Inter4k is in your training data, I thought using Inter4k data for finetune should not have such abnormal results even with very few training steps?
Reproduction
btw, there seems to be some problems in the process of generating csv. When I execute the caption_llava script, the generated csv only has path, text, num_frames. I manually merge the height, width and other information into the final csv. you can ignore the generation of csv, I post the first few rows of data of csv:
All instructions are executed in
./scripts
train:run inference:
Logs
There are a lot of inference logs, and no errors are reported. Maybe the logs are not helpful. I can provide more log snippets if you need them.
System Info
There are a lot of pkgs and a lot of useless information, so I only post the versions of key libraries. accelerate 0.21.0 apex 0.1 clip 1.0 colossalai 0.3.7 einops 0.6.1 mmengine 0.10.4 oauthlib 3.2.2 opencv-python 4.9.0.80 pandas 2.2.2 PyYAML 6.0.1 safetensors 0.4.3 scenedetect 0.6.3 scikit-learn 1.2.2 scipy 1.13.0 setproctitle 1.3.3 setuptools 68.2.2 six 1.16.0 tensorboard 2.16.2 tensorboard-data-server 0.7.2 timm 0.6.13 tokenizers 0.15.1 torch 2.2.2 torchaudio 2.2.2+cu121 torchvision 0.17.2+cu121 tqdm 4.66.2 transformers 4.37.2 wandb 0.16.6 xformers 0.0.25.post1