hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
20.1k stars 1.91k forks source link

为什么我训练的时候,每个epoch非常快呐?就像没有没有正确加载数据一样? #562

Open xbyym opened 3 days ago

xbyym commented 3 days ago

[2024-06-29 05:36:29] Beginning epoch 0... Epoch 0: 0it [00:00, ?it/s] INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [2024-06-29 05:36:30] Building buckets... INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [2024-06-29 05:36:31] Bucket Info: [2024-06-29 05:36:31] Bucket [#sample, #batch] by aspect ratio: {'0.56': [160, 3]} [2024-06-29 05:36:31] Image Bucket [#sample, #batch] by HxWxT: {} [2024-06-29 05:36:31] Video Bucket [#sample, #batch] by HxWxT: {('144p', 51): [160, 3]} [2024-06-29 05:36:31] #training batch: 3, #training sample: 160, #non empty bucket: 1 [2024-06-29 05:36:31] Beginning epoch 1... Epoch 1: 0it [00:00, ?it/s]INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. Epoch 1: 0it [00:00, ?it/s] INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.

这是csv文件(我创建了几百个视频为了微调): path,text,id,relpath,num_frames,height,width,aspect_ratio,fps,resolution /home/yy/Open-Sora/clips/sample_0_scene-0.mp4,a dog is running,sample_0_scene-0,sample_0_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_1_scene-0.mp4,a dog is running,sample_1_scene-0,sample_1_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_2_scene-0.mp4,a dog is running,sample_2_scene-0,sample_2_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0 /home/yy/Open-Sora/clips/sample_3_scene-0.mp4,a dog is running,sample_3_scene-0,sample_3_scene-0.mp4,96.0,144.0,256.0,0.5625,24.0,36864.0

请问是我那里遗漏了吗?好像训练没有成功

CIntellifusion commented 3 days ago

我也是这个问题

xbyym commented 3 days ago

我也是这个问题

batchsize 没满,最后一个drop_last默认丢弃,改为false就好了

CIntellifusion commented 2 days ago

我也是这个问题

batchsize 没满,最后一个drop_last默认丢弃,改为false就好了

Thanks 但是我有两百个样本,batch_size=4,我目前怀疑是bucket和视频精度不匹配的问题。