data process for pre-training and fine-tuning

hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All

https://hpcaitech.github.io/Open-Sora/

Apache License 2.0

20.77k stars 1.97k forks source link

data process for pre-training and fine-tuning #393

Closed liuheng0111 closed 1 month ago

liuheng0111 commented 2 months ago

Here you said prepare a 10M dataset. What is it composed of, panda-10m and HD-VG-130M? How much of the HD-VG dataset has been used? The pre-training has 9.7M videos. Does this mean that the processing pipeline only filtered out 3% of the videos? What processing steps were involved in the pre-training, and what processing steps were involved in the fine-tuning? What filtering thresholds were used for each?

handsomeZhuang commented 2 months ago

数据处理跟训练同时进行吗？为什么不提前进行离线预处理数据呢？

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 7 days since being marked as stale.