hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.54k stars 2.06k forks source link

The training is consistently getting stuck and is not proceeding. #608

Open gracezhao1997 opened 1 month ago

gracezhao1997 commented 1 month ago

The training is consistently getting stuck and is not proceeding. [2024-07-15 13:32:09] Preparing for distributed training... [2024-07-15 13:32:09] Boosting model for distributed training [2024-07-15 13:32:09] Training for 1000 epochs with 32425 steps per epoch [2024-07-15 13:32:11] Beginning epoch 0... Epoch 0: 0%| | 0/32425 [00:00<?, ?it/s]/mnt/vepfs/zhaomin/anaconda3/envs/ckh/lib/python3.9/site-packages/colossalai/nn/optimizer/nvme_optimizer.py:55: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() numel += p.storage().size()

JThh commented 1 month ago

How long was it stuck? Can you try reducing batch size, or add more intermediate print() to ensure it is proceeding?

Take reference from our training report: https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_03.md#more-data-and-better-multi-stage-training.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

gracezhao1997 commented 1 month ago

The training phase is stuck here: [extension] Compiling the JIT cpu_adam_x86 kernel during runtime now

github-actions[bot] commented 5 days ago

This issue is stale because it has been open for 7 days with no activity.