hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.55k stars 2.07k forks source link

[Help] Training problem (how to fix this problem?) #636

Closed KihongK closed 4 weeks ago

KihongK commented 1 month ago

I tried to test Training with sample data

my dataset

path,id,relpath,num_frames,height,width,aspect_ratio,fps,resolution,aes,text,num_frames_y
/home/hed/Open-Sora/sample_data/sample_data_split/big_buck_bunny_240p_1mb_scene-0.mp4,big_buck_bunny_240p_1mb_scene-0,big_buck_bunny_240p_1mb_scene-0.mp4,126,240,320,0.75,15.0,76800,4.688505172729492,bunny,126
/home/hed/Open-Sora/sample_data/sample_data_split/big_buck_bunny_240p_5mb_scene-1.mp4,big_buck_bunny_240p_5mb_scene-1,big_buck_bunny_240p_5mb_scene-1.mp4,69,240,320,0.75,15.0,76800,4.362994194030762,big bunny,69
/home/hed/Open-Sora/sample_data/sample_data_split/big_buck_bunny_240p_5mb_scene-5.mp4,big_buck_bunny_240p_5mb_scene-5,big_buck_bunny_240p_5mb_scene-5.mp4,32,240,320,0.75,15.0,76800,4.103738784790039,bunny,32
/home/hed/Open-Sora/sample_data/sample_data_split/big_buck_bunny_240p_5mb_scene-8.mp4,big_buck_bunny_240p_5mb_scene-8,big_buck_bunny_240p_5mb_scene-8.mp4,147,240,320,0.75,15.0,76800,4.822460174560547,bunny,147
...

download video (from https://sample-videos.com/) and follow data_processing until 3.2 Filter by aesthetic scores. (Cause i have generate caption issue)

I just want to test Training so I wrote caption myself

And run Training Script torchrun --standalone --nproc_per_node 1 scripts/train.py configs/opensora-v1-2/train/stage1.py --data-path /home/hed/Open-Sora/merged_file.csv

I don't think it's trained normally Could you kindly advise me on how to fix this issue?

(opensora-inf) hed@test-opensora-a100-spot-roy:~/Open-Sora$ torchrun --standalone --nproc_per_node 1     scripts/train.py     configs/opensora-v1-2/train/stage1.py     --data-path /home/hed/Open-Sora/merged_file.csv
/home/hed/miniconda3/envs/opensora-inf/lib/python3.9/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/home/hed/miniconda3/envs/opensora-inf/lib/python3.9/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
[2024-07-24 07:57:17] Experiment directory created at outputs/005-STDiT3-XL-2
[2024-07-24 07:57:17] Training configuration:
 {'adam_eps': 1e-15,
 'bucket_config': {'1024': {1: (0.05, 36)},
                   '1080p': {1: (0.1, 5)},
                   '144p': {1: (1.0, 475),
                            51: (1.0, 51),
                            102: ((1.0, 0.33), 27),
                            204: ((1.0, 0.1), 13),
                            408: ((1.0, 0.1), 6)},
                   '2048': {1: (0.1, 5)},
                   '240p': {1: (0.3, 297),
                            51: (0.4, 20),
                            102: ((0.4, 0.33), 10),
                            204: ((0.4, 0.1), 5),
                            408: ((0.4, 0.1), 2)},
                   '256': {1: (0.4, 297),
                           51: (0.5, 20),
                           102: ((0.5, 0.33), 10),
                           204: ((0.5, 0.1), 5),
                           408: ((0.5, 0.1), 2)},
                   '360p': {1: (0.2, 141),
                            51: (0.15, 8),
                            102: ((0.15, 0.33), 4),
                            204: ((0.15, 0.1), 2),
                            408: ((0.15, 0.1), 1)},
                   '480p': {1: (0.1, 89)},
                   '512': {1: (0.1, 141)},
                   '720p': {1: (0.05, 36)}},
 'ckpt_every': 2,
 'config': 'configs/opensora-v1-2/train/stage1.py',
 'dataset': {'data_path': '/home/hed/Open-Sora/merged_file.csv',
             'transform_name': 'resize_crop',
             'type': 'VariableVideoTextDataset'},
 'dtype': 'bf16',
 'ema_decay': 0.99,
 'epochs': 10,
 'grad_checkpoint': True,
 'grad_clip': 1.0,
 'load': None,
 'log_every': 1,
 'lr': 0.0001,
 'mask_ratios': {'image_head': 0.05,
                 'image_head_tail': 0.025,
                 'image_random': 0.025,
                 'image_tail': 0.025,
                 'intepolate': 0.005,
                 'quarter_head': 0.005,
                 'quarter_head_tail': 0.005,
                 'quarter_random': 0.005,
                 'quarter_tail': 0.005,
                 'random': 0.05},
 'model': {'enable_flash_attn': True,
           'enable_layernorm_kernel': True,
           'freeze_y_embedder': True,
           'from_pretrained': None,
           'qk_norm': True,
           'type': 'STDiT3-XL/2'},
 'num_bucket_build_workers': 8,
 'num_workers': 1,
 'outputs': 'outputs',
 'plugin': 'zero2',
 'record_time': False,
 'scheduler': {'sample_method': 'logit-normal',
               'type': 'rflow',
               'use_timestep_transform': True},
 'seed': 42,
 'start_from_scratch': False,
 'text_encoder': {'from_pretrained': 'DeepFloyd/t5-v1_1-xxl',
                  'model_max_length': 300,
                  'shardformer': True,
                  'type': 't5'},
 'vae': {'from_pretrained': 'hpcai-tech/OpenSora-VAE-v1.2',
         'micro_batch_size': 1,
         'micro_frame_size': 17,
         'type': 'OpenSoraVAE_V1_2'},
 'wandb': False,
 'warmup_steps': 1000}
[2024-07-24 07:57:18] Building dataset...
[2024-07-24 07:57:18] Dataset contains 26 samples.
[2024-07-24 07:57:18] Number of buckets: 626
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:57:18] Building buckets...
[2024-07-24 07:57:18] Bucket Info:
[2024-07-24 07:57:18] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [2, 0], '0.57': [1, 0], '0.72': [8, 0], '0.75': [5, 0]}
[2024-07-24 07:57:18] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:57:18] Video Bucket [#sample, #batch] by HxWxT:
{('240p', 51): [1, 0],
 ('256', 102): [1, 0],
 ('256', 51): [8, 0],
 ('144p', 102): [2, 0],
 ('144p', 51): [4, 0]}
[2024-07-24 07:57:18] #training batch: 0, #training sample: 16, #non empty bucket: 7
[2024-07-24 07:57:18] Building models...
/home/hed/miniconda3/envs/opensora-inf/lib/python3.9/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 13.03s/it]
[2024-07-24 07:58:06] [Diffusion] Trainable model params: 1.12 B, Total model params: 1.12 B
[extension] Compiling the JIT cpu_adam_x86 kernel during runtime now
[extension] Time taken to compile cpu_adam_x86 op: 0.6594498157501221 seconds
[extension] Compiling the JIT fused_optim_cuda kernel during runtime now
[extension] Time taken to compile fused_optim_cuda op: 0.5570011138916016 seconds
[2024-07-24 07:58:08] mask ratios: {'random': 0.05, 'intepolate': 0.005, 'quarter_random': 0.005, 'quarter_head': 0.005, 'quarter_tail': 0.005, 'quarter_head_tail': 0.005, 'image_random': 0.025, 'image_head': 0.05, 'image_tail': 0.025, 'image_head_tail': 0.025, 'identity': 0.8}
[2024-07-24 07:58:08] Preparing for distributed training...
[2024-07-24 07:58:08] Boosting model for distributed training
[2024-07-24 07:58:08] Training for 10 epochs with 0 steps per epoch
[2024-07-24 07:58:08] Beginning epoch 0...
Epoch 0: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:08] Building buckets...
[2024-07-24 07:58:09] Bucket Info:
[2024-07-24 07:58:09] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [2, 0], '0.57': [1, 0], '0.72': [9, 0], '0.75': [4, 0]}
[2024-07-24 07:58:09] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:09] Video Bucket [#sample, #batch] by HxWxT:
{('240p', 51): [1, 0],
 ('256', 102): [1, 0],
 ('256', 51): [9, 0],
 ('144p', 102): [1, 0],
 ('144p', 51): [4, 0]}
[2024-07-24 07:58:09] #training batch: 0, #training sample: 16, #non empty bucket: 7
[2024-07-24 07:58:09] Beginning epoch 1...
Epoch 1: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:09] Building buckets...
[2024-07-24 07:58:09] Bucket Info:
[2024-07-24 07:58:09] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [2, 0], '0.57': [1, 0], '0.72': [4, 0], '0.75': [9, 0]}
[2024-07-24 07:58:09] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:09] Video Bucket [#sample, #batch] by HxWxT:
{('360p', 51): [1, 0],
 ('240p', 51): [1, 0],
 ('256', 102): [2, 0],
 ('256', 51): [3, 0],
 ('144p', 51): [9, 0]}
[2024-07-24 07:58:09] #training batch: 0, #training sample: 16, #non empty bucket: 6
[2024-07-24 07:58:09] Beginning epoch 2...
Epoch 2: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:09] Building buckets...
[2024-07-24 07:58:10] Bucket Info:
[2024-07-24 07:58:10] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [2, 0], '0.57': [1, 0], '0.72': [5, 0], '0.75': [8, 0]}
[2024-07-24 07:58:10] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:10] Video Bucket [#sample, #batch] by HxWxT:
{('240p', 51): [1, 0],
 ('256', 51): [6, 0],
 ('144p', 102): [2, 0],
 ('144p', 51): [7, 0]}
[2024-07-24 07:58:10] #training batch: 0, #training sample: 16, #non empty bucket: 6
[2024-07-24 07:58:10] Beginning epoch 3...
Epoch 3: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:10] Building buckets...
[2024-07-24 07:58:11] Bucket Info:
[2024-07-24 07:58:11] Bucket [#sample, #batch] by aspect ratio:
{'0.57': [3, 0], '0.72': [10, 0], '0.75': [3, 0]}
[2024-07-24 07:58:11] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:11] Video Bucket [#sample, #batch] by HxWxT:
{('256', 102): [2, 0], ('256', 51): [11, 0], ('144p', 51): [3, 0]}
[2024-07-24 07:58:11] #training batch: 0, #training sample: 16, #non empty bucket: 4
[2024-07-24 07:58:11] Beginning epoch 4...
Epoch 4: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:11] Building buckets...
[2024-07-24 07:58:12] Bucket Info:
[2024-07-24 07:58:12] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [1, 0], '0.57': [2, 0], '0.72': [9, 0], '0.75': [4, 0]}
[2024-07-24 07:58:12] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:12] Video Bucket [#sample, #batch] by HxWxT:
{('240p', 51): [1, 0],
 ('256', 102): [3, 0],
 ('256', 51): [8, 0],
 ('144p', 102): [1, 0],
 ('144p', 51): [3, 0]}
[2024-07-24 07:58:12] #training batch: 0, #training sample: 16, #non empty bucket: 6
[2024-07-24 07:58:12] Beginning epoch 5...
Epoch 5: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:12] Building buckets...
[2024-07-24 07:58:13] Bucket Info:
[2024-07-24 07:58:13] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [1, 0], '0.57': [2, 0], '0.72': [9, 0], '0.75': [4, 0]}
[2024-07-24 07:58:13] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:13] Video Bucket [#sample, #batch] by HxWxT:
{('256', 102): [3, 0],
 ('256', 51): [8, 0],
 ('144p', 102): [1, 0],
 ('144p', 51): [4, 0]}
[2024-07-24 07:58:13] #training batch: 0, #training sample: 16, #non empty bucket: 7
[2024-07-24 07:58:13] Beginning epoch 6...
Epoch 6: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:13] Building buckets...
[2024-07-24 07:58:13] Bucket Info:
[2024-07-24 07:58:13] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [3, 0], '0.72': [5, 0], '0.75': [8, 0]}
[2024-07-24 07:58:13] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:13] Video Bucket [#sample, #batch] by HxWxT:
{('256', 102): [1, 0],
 ('256', 51): [4, 0],
 ('144p', 102): [3, 0],
 ('144p', 51): [8, 0]}
[2024-07-24 07:58:13] #training batch: 0, #training sample: 16, #non empty bucket: 6
[2024-07-24 07:58:13] Beginning epoch 7...
Epoch 7: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:14] Building buckets...
[2024-07-24 07:58:14] Bucket Info:
[2024-07-24 07:58:14] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [3, 0], '0.72': [7, 0], '0.75': [6, 0]}
[2024-07-24 07:58:14] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:14] Video Bucket [#sample, #batch] by HxWxT:
{('360p', 51): [1, 0],
 ('240p', 51): [1, 0],
 ('256', 102): [1, 0],
 ('256', 51): [6, 0],
 ('144p', 102): [1, 0],
 ('144p', 51): [6, 0]}
[2024-07-24 07:58:14] #training batch: 0, #training sample: 16, #non empty bucket: 7
[2024-07-24 07:58:14] Beginning epoch 8...
Epoch 8: 0it [00:00, ?it/s]
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-07-24 07:58:14] Building buckets...
[2024-07-24 07:58:15] Bucket Info:
[2024-07-24 07:58:15] Bucket [#sample, #batch] by aspect ratio:
{'0.56': [3, 0], '0.72': [9, 0], '0.75': [4, 0]}
[2024-07-24 07:58:15] Image Bucket [#sample, #batch] by HxWxT:
{}
[2024-07-24 07:58:15] Video Bucket [#sample, #batch] by HxWxT:
{('240p', 51): [3, 0],
 ('256', 51): [9, 0],
 ('144p', 102): [1, 0],
 ('144p', 51): [3, 0]}
[2024-07-24 07:58:15] #training batch: 0, #training sample: 16, #non empty bucket: 4
[2024-07-24 07:58:15] Beginning epoch 9...
Epoch 9: 0it [00:00, ?it/s]
(opensora-inf) hed@test-opensora-a100-spot-roy:~/Open-Sora$
281LinChenjian commented 1 month ago

hello,I meet some problem need to communicate with you.do you have Wechat?

KihongK commented 1 month ago

hello,I meet some problem need to communicate with you.do you have Wechat?

Sorry I don't have wechat account

281LinChenjian commented 1 month ago

Did you train normally after reducing the batch_size according to my method yesterday? Is the inference effect of your trained model normal?

KihongK commented 1 month ago

Did you train normally after reducing the batch_size according to my method yesterday? Is the inference effect of your trained model normal?

This issue created before we communicated 😀

But another error occurred and is being resolved (I haven't trained the model yet 😇)

[2024-07-26 08:21:23] #training batch: 16, #training sample: 16, #non empty bucket: 7
[2024-07-26 08:21:23] Building models...
/home/ac01-kkhong/miniconda3/envs/opensora-inf/lib/python3.9/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Loading checkpoint shards: 100%
github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 4 weeks ago

This issue was closed because it has been inactive for 7 days since being marked as stale.