when i start to train v1.1 with my own dataset, total train samples always equals to 0

wonder-hy commented 3 months ago

everything looks fine in the start as i followed every training instructions:

[2024-06-19 10:38:12] Experiment directory created at outputs/016-STDiT2-XL-2 [2024-06-19 10:38:12] Dataset contains 180 samples. Number of buckets: 729 Number of buckets: 729 Number of buckets: 729 Number of buckets: 729 Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.14s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:32<00:00, 16.18s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.98s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:34<00:00, 17.40s/it] Missing keys: [] Unexpected keys: [] Missing keys: [] Unexpected keys: [] Missing keys: [] Unexpected keys: [] [2024-06-19 10:39:00] Trainable model params: 731.90 M, Total model params: 731.90 M Missing keys: [] Unexpected keys: [] /root/anaconda3/envs/newsora/lib/python3.10/site-packages/colossalai/kernel/extensions/utils.py:96: UserWarning: [extension] The CUDA version on the system (12.2) does not match with the version (12.1) torch was compiled with. The mismatch is found in the minor version. As the APIs are compatible, we will allow compilation to proceed. If you encounter any issue when using the built kernel, please try to build it again with fully matched CUDA versions warnings.warn( [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

But when its start to train, it looks like that no data is actually being trained as thetotal training samples always == 0 ,and cost only very little time for each epoch mask ratios: {'mask_no': 0.75, 'mask_quarter_random': 0.025, 'mask_quarter_head': 0.025, 'mask_quarter_tail': 0.025, 'mask_quarter_head_tail': 0.05, 'mask_image_random': 0.025, 'mask_image_head': 0.025, 'mask_image_tail': 0.025, 'mask_image_head_tail': 0.05} Total training samples: 0, num buckets: 0 Bucket samples: {} Bucket samples by aspect ratio: defaultdict(<class 'int'>, {}) Bucket samples by HxWxT: defaultdict(<class 'int'>, {}) Number of batches: 0 [2024-06-19 10:39:06] Training for 1000 epochs with 0 steps per epoch INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. [2024-06-19 10:39:06] Beginning epoch 0... Epoch 0: 0it [00:00, ?it/s] Epoch done, recomputing batch sampler Epoch done, recomputing batch sampler Epoch done, recomputing batch sampler INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. Epoch done, recomputing batch sampler Total training samples: 0, num buckets: 0 Bucket samples: {} Bucket samples by aspect ratio: defaultdict(<class 'int'>, {}) Bucket samples by HxWxT: defaultdict(<class 'int'>, {}) Number of batches: 0 [2024-06-19 10:39:09] Beginning epoch 1... Epoch 1: 0it [00:00, ?it/s]Epoch done, recomputing batch sampler Epoch 1: 0it [00:00, ?it/s] Epoch done, recomputing batch sampler Epoch done, recomputing batch sampler INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. INFO: Pandarallel will run on 16 workers. INFO: Pandarallel will use Memory file system to transfer data between the main process and workers. Epoch done, recomputing batch sampler Epoch done, recomputing batch sampler Epoch done, recomputing batch sampler Total training samples: 0, num buckets: 0 Bucket samples: {} Bucket samples by aspect ratio: defaultdict(<class 'int'>, {}) Bucket samples by HxWxT: defaultdict(<class 'int'>, {}) Number of batches: 0

why is that happen? really need help my dataset csv look like below: path,text,num_frames,height,width,aspect_ratio,fps /data1/why/Open-Sora/test_dataset/train_videos/person2_11.mp4,CT,11,512,512,1.0,11 /data1/why/Open-Sora/test_dataset/train_videos/person14_19.mp4,CT,11,512,512,1.0,11 /data1/why/Open-Sora/test_dataset/train_videos/person5_1.mp4,CT,11,512,512,1.0,11

zhengzangw commented 3 months ago

The problem is that the videos in your dataset have too little num_frames:

path,text,num_frames,height,width,aspect_ratio,fps
/data1/why/Open-Sora/test_dataset/train_videos/person2_11.mp4,CT,**11**,512,512,1.0,11
/data1/why/Open-Sora/test_dataset/train_videos/person14_19.mp4,CT,**11**,512,512,1.0,11
/data1/why/Open-Sora/test_dataset/train_videos/person5_1.mp4,CT,**11**,512,512,1.0,11

Our bucket in 1.1 requires at least 50 frames. There are two suggestions:

You should check the num frames of your videos. Usually a video contains much more frames than 11.
Otherwise, you should change the bucket config and frame_interval to let the model accept a smaller length video.

wonder-hy commented 3 months ago

The problem is that the videos in your dataset have too little num_frames:
path,text,num_frames,height,width,aspect_ratio,fps
/data1/why/Open-Sora/test_dataset/train_videos/person2_11.mp4,CT,**11**,512,512,1.0,11
/data1/why/Open-Sora/test_dataset/train_videos/person14_19.mp4,CT,**11**,512,512,1.0,11
/data1/why/Open-Sora/test_dataset/train_videos/person5_1.mp4,CT,**11**,512,512,1.0,11
Our bucket in 1.1 requires at least 50 frames. There are two suggestions:

You should check the num frames of your videos. Usually a video contains much more frames than 11.

Otherwise, you should change the bucket config and frame_interval to let the model accept a smaller length video.

thankyou for your reply! I 'll try

hpcaitech / Open-Sora

when i start to train v1.1 with my own dataset, total train samples always equals to 0 #468