hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.65k stars 2.09k forks source link

Training gets a KeyError 'Height' #665

Closed lweingart closed 2 weeks ago

lweingart commented 1 month ago

Hello guys,

I just followed your precess to prepare my own dataset as described here and I must admit it went impressively well, no error whatsoever.

Then I went on to check the training part and ran this command: !torchrun --standalone --nproc_per_node 1 -m scripts.train \ configs/opensora-v1-2/train/stage1.py --data-path {ROOT_META}/meta_clips_caption.csv --ckpt-path {MODEL_OUTPUT}/my_sora.pt but I end up with a KeyError 'Height'.

Could you please help me identify a way to fix this ? Any help would be greatly appreciated.

Thank you very much in advance Cheers

Here is the full log trace:

/usr/local/lib/python3.10/dist-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/usr/local/lib/python3.10/dist-packages/torch/utils/_pytree.py:300: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
[2024-08-11 10:54:15] Experiment directory created at outputs/001-STDiT3-XL-2
[2024-08-11 10:54:15] Training configuration:
 {'adam_eps': 1e-15,
 'bucket_config': {'1024': {1: (0.05, 36)},
                   '1080p': {1: (0.1, 5)},
                   '144p': {1: (1.0, 475),
                            51: (1.0, 51),
                            102: ((1.0, 0.33), 27),
                            204: ((1.0, 0.1), 13),
                            408: ((1.0, 0.1), 6)},
                   '2048': {1: (0.1, 5)},
                   '240p': {1: (0.3, 297),
                            51: (0.4, 20),
                            102: ((0.4, 0.33), 10),
                            204: ((0.4, 0.1), 5),
                            408: ((0.4, 0.1), 2)},
                   '256': {1: (0.4, 297),
                           51: (0.5, 20),
                           102: ((0.5, 0.33), 10),
                           204: ((0.5, 0.1), 5),
                           408: ((0.5, 0.1), 2)},
                   '360p': {1: (0.2, 141),
                            51: (0.15, 8),
                            102: ((0.15, 0.33), 4),
                            204: ((0.15, 0.1), 2),
                            408: ((0.15, 0.1), 1)},
                   '480p': {1: (0.1, 89)},
                   '512': {1: (0.1, 141)},
                   '720p': {1: (0.05, 36)}},
 'ckpt_every': 200,
 'config': 'configs/opensora-v1-2/train/stage1.py',
 'dataset': {'data_path': '/content/drive/MyDrive/Open-Sora/opensora/data/meta/meta_clips_caption.csv',
             'transform_name': 'resize_crop',
             'type': 'VariableVideoTextDataset'},
 'dtype': 'bf16',
 'ema_decay': 0.99,
 'epochs': 1000,
 'grad_checkpoint': True,
 'grad_clip': 1.0,
 'load': None,
 'log_every': 10,
 'lr': 0.0001,
 'mask_ratios': {'image_head': 0.05,
                 'image_head_tail': 0.025,
                 'image_random': 0.025,
                 'image_tail': 0.025,
                 'intepolate': 0.005,
                 'quarter_head': 0.005,
                 'quarter_head_tail': 0.005,
                 'quarter_random': 0.005,
                 'quarter_tail': 0.005,
                 'random': 0.05},
 'model': {'enable_flash_attn': True,
           'enable_layernorm_kernel': True,
           'freeze_y_embedder': True,
           'from_pretrained': '/content/drive/MyDrive/Open-Sora/opensora/output/my_sora.pt',
           'qk_norm': True,
           'type': 'STDiT3-XL/2'},
 'num_bucket_build_workers': 16,
 'num_workers': 8,
 'outputs': 'outputs',
 'plugin': 'zero2',
 'record_time': False,
 'scheduler': {'sample_method': 'logit-normal',
               'type': 'rflow',
               'use_timestep_transform': True},
 'seed': 42,
 'start_from_scratch': False,
 'text_encoder': {'from_pretrained': 'DeepFloyd/t5-v1_1-xxl',
                  'model_max_length': 300,
                  'shardformer': True,
                  'type': 't5'},
 'vae': {'from_pretrained': 'hpcai-tech/OpenSora-VAE-v1.2',
         'micro_batch_size': 4,
         'micro_frame_size': 17,
         'type': 'OpenSoraVAE_V1_2'},
 'wandb': False,
 'warmup_steps': 1000}
2024-08-11 10:54:16.325108: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-11 10:54:16.358730: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-11 10:54:16.371316: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-11 10:54:17.866056: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[2024-08-11 10:54:18] Building dataset...
[2024-08-11 10:54:18] Dataset contains 954 samples.
[2024-08-11 10:54:18] Number of buckets: 626
/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  warnings.warn(_create_warning_msg(
INFO: Pandarallel will run on 16 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
[2024-08-11 10:54:18] Building buckets...
/usr/lib/python3.10/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
  self.pid = os.fork()
[rank0]: multiprocessing.pool.RemoteTraceback: 
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 3791, in get_loc
[rank0]:     return self._engine.get_loc(casted_key)
[rank0]:   File "index.pyx", line 152, in pandas._libs.index.IndexEngine.get_loc
[rank0]:   File "index.pyx", line 181, in pandas._libs.index.IndexEngine.get_loc
[rank0]:   File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
[rank0]:   File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
[rank0]: KeyError: 'height'

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
[rank0]:     result = (True, func(*args, **kwds))
[rank0]:   File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
[rank0]:     return list(itertools.starmap(args[0], args[1]))
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandarallel/core.py", line 95, in __call__
[rank0]:     result = self.work_function(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandarallel/data_types/dataframe.py", line 32, in work
[rank0]:     return data.apply(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 10034, in apply
[rank0]:     return op.apply().__finalize__(self, method="apply")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py", line 837, in apply
[rank0]:     return self.apply_standard()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py", line 965, in apply_standard
[rank0]:     results, res_index = self.apply_series_generator()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py", line 981, in apply_series_generator
[rank0]:     results[i] = self.func(v, *self.args, **self.kwargs)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/datasets/sampler.py", line 22, in apply
[rank0]:     data["height"],
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/series.py", line 1040, in __getitem__
[rank0]:     return self._get_value(key)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/series.py", line 1156, in _get_value
[rank0]:     loc = self.index.get_loc(label)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 3798, in get_loc
[rank0]:     raise KeyError(key) from err
[rank0]: KeyError: 'height'
[rank0]: """

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 412, in <module>
[rank0]:     main()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/scripts/train.py", line 111, in main
[rank0]:     num_steps_per_epoch = len(dataloader)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 484, in __len__
[rank0]:     return len(self._index_sampler)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/datasets/sampler.py", line 191, in __len__
[rank0]:     return self.get_num_batch() // dist.get_world_size()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/datasets/sampler.py", line 221, in get_num_batch
[rank0]:     bucket_sample_dict = self.group_by_bucket()
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/datasets/sampler.py", line 200, in group_by_bucket
[rank0]:     bucket_ids = self.dataset.data.parallel_apply(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandarallel/core.py", line 333, in closure
[rank0]:     results_promise.get()
[rank0]:   File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
[rank0]:     raise self._value
[rank0]:   File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
[rank0]:     result = (True, func(*args, **kwds))
[rank0]:   File "/usr/lib/python3.10/multiprocessing/pool.py", line 51, in starmapstar
[rank0]:     return list(itertools.starmap(args[0], args[1]))
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandarallel/core.py", line 95, in __call__
[rank0]:     result = self.work_function(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandarallel/data_types/dataframe.py", line 32, in work
[rank0]:     return data.apply(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/frame.py", line 10034, in apply
[rank0]:     return op.apply().__finalize__(self, method="apply")
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py", line 837, in apply
[rank0]:     return self.apply_standard()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py", line 965, in apply_standard
[rank0]:     results, res_index = self.apply_series_generator()
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/apply.py", line 981, in apply_series_generator
[rank0]:     results[i] = self.func(v, *self.args, **self.kwargs)
[rank0]:   File "/content/drive/MyDrive/Open-Sora/opensora/opensora/datasets/sampler.py", line 22, in apply
[rank0]:     data["height"],
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/series.py", line 1040, in __getitem__
[rank0]:     return self._get_value(key)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/series.py", line 1156, in _get_value
[rank0]:     loc = self.index.get_loc(label)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/pandas/core/indexes/base.py", line 3798, in get_loc
[rank0]:     raise KeyError(key) from err
[rank0]: KeyError: 'height'
E0811 10:54:24.903000 134491422757504 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 13304) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-11_10:54:24
  host      : 162ac051b6a2
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13304)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
g-jing commented 1 month ago

I found the same issue. Do you have any solution yet?

lweingart commented 1 month ago

Hi, no I still don't. I'm stuck for now

lweingart commented 1 month ago

@g-jing , have you been able to find a solution on your side ? As for me I still don't.

lweingart commented 1 month ago

Hi again, for some reason the CSV created at data processing step 3.2:

# 3.2 Filter by aesthetic scores. This should output ${ROOT_META}/meta_clips_info_fmin1_aes_aesmin5.csv
python -m tools.datasets.datautil ${ROOT_META}/meta_clips_info_fmin1_aes.csv --aesmin 5

has this first line after being generated:

path,id,relpath,num_frames,height,width,aspect_ratio,fps,resolution,aes

but for some reason those headers were lost at step 4.1 and the next csv file, named meta_clips_info_fmin1_aes_aesmin5_caption_part*.csv from step:

# 4.1 Generate caption. This should output ${ROOT_META}/meta_clips_info_fmin1_aes_aesmin5_caption_part*.csv
torchrun --nproc_per_node 8 --standalone -m tools.caption.caption_llava \
  ${ROOT_META}/meta_clips_info_fmin1_aes_aesmin5.csv \
  --dp-size 8 \
  --tp-size 1 \
  --model-path /path/to/llava-v1.6-mistral-7b \
  --prompt video

only has:

path,text,num_frames

I don't know why most columns were lost, but reintegrating the width and height columns should do the trick. I'm rerunning the data processing to check if i missed some errors in the logs, and I'll get back here

lweingart commented 1 month ago

So, the code tools.caption.caption_llava at line 209 has this:

dp_writer.writerow(["path", "text", "num_frames"])

Which, as can be read, only creates the three 'path', 'text' and 'num_frames' columns. However, manually adding the height and width columns to my csv file fixed the KeyError problem. Unfortunately, I don't have any automated way to reintroduce these columns based on the previous csv file. Lucky for me, in this case I specifically resized all my videos to 256x144 so it was easily done, but in a more classic situation where videos can be of multiple resolutions, I don't have any solution to propose.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

kehuanfeng commented 2 weeks ago

I am facing the same issue, is there any official fix yet?