hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
22.22k stars 2.17k forks source link

root cause: no __init__.py file under shardformer folder. #251

Closed Get-David closed 7 months ago

Get-David commented 7 months ago
          root cause: no __init__.py file under shardformer folder.

Originally posted by @BountyMage in https://github.com/hpcaitech/Open-Sora/issues/232#issuecomment-2024569674 我有同样的错误,我加了init一样报错/home/zdw/Open-Sora/opensora/acceleration/shardformer/__init__.py

(opensora) zdw@ai-gpu-server149:~/Open-Sora$ torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x512x512.py --data-path /home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
Config (path: configs/opensora/train/16x512x512.py): {'num_frames': 16, 'frame_interval': 3, 'image_size': (512, 512), 'root': None, 'data_path': '/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv', 'use_image_transform': False, 'num_workers': 4, 'dtype': 'bf16', 'grad_checkpoint': False, 'plugin': 'zero2', 'sp_size': 1, 'model': {'type': 'STDiT-XL/2', 'space_scale': 1.0, 'time_scale': 1.0, 'from_pretrained': '/home/zdw/Open-Sora/pre_training/Open-Sora/OpenSora-v1-HQ-16x512x512.pth', 'enable_flashattn': False, 'enable_layernorm_kernel': False}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/sd-vae-ft-ema', 'micro_batch_size': 128}, 'text_encoder': {'type': 't5', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/t5-v1_1-xxl', 'model_max_length': 120, 'shardformer': True}, 'scheduler': {'type': 'iddpm', 'timestep_respacing': ''}, 'seed': 42, 'outputs': 'outputs', 'wandb': False, 'epochs': 1000, 'log_every': 10, 'ckpt_every': 500, 'load': None, 'batch_size': 8, 'lr': 2e-05, 'grad_clip': 1.0, 'multi_resolution': False}
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[04/02/24 15:13:50] INFO     colossalai - colossalai - INFO:                                                                                
                             /data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch      
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1                          
[2024-04-02 15:13:50] Experiment directory created at outputs/010-F16S3-STDiT-XL-2
[2024-04-02 15:13:50] Dataset contains 1 videos (/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv)
[2024-04-02 15:13:50] Total batch size: 8
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 2/2 [00:45<00:00, 22.79s/it]
Traceback (most recent call last):
  File "/home/zdw/Open-Sora/scripts/train.py", line 287, in <module>
    main()
  File "/home/zdw/Open-Sora/scripts/train.py", line 132, in main
    text_encoder = build_module(cfg.text_encoder, MODELS, device=device)  # T5 must be fp32
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/registry.py", line 22, in build_module
    return builder.build(cfg)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 287, in __init__
    self.shardformer_t5()
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 292, in shardformer_t5
    from opensora.acceleration.shardformer.policy.t5_encoder import T5EncoderPolicy
ModuleNotFoundError: No module named 'opensora.acceleration.shardformer'
[2024-04-02 15:14:41,917] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2216794) of binary: /data/share8/zdw/miniconda3/envs/opensora/bin/python
Traceback (most recent call last):
  File "/data/share8/zdw/miniconda3/envs/opensora/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-02_15:14:41
  host      : ai-gpu-server149
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2216794)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(opensora) zdw@ai-gpu-server149:~/Open-Sora$ 
BountyMage commented 7 months ago

有重新再install一遍吗?

Weixiang-Sun commented 7 months ago

Creat one and reinstall it

Get-David commented 7 months ago

有重新再install一遍吗?

重新安装什么?

Get-David commented 7 months ago

Creat one and reinstall it

Reinstall what, the conda environment, or what?

Weixiang-Sun commented 7 months ago

Creat one and reinstall it

Reinstall what, the conda environment, or what?

only opensora

Get-David commented 7 months ago

Creat one and reinstall it

Reinstall what, the conda environment, or what?

only opensora

I tried the following command again, but it still prompted an error

git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .

The error is reported as follows

(opensora) zdw@ai-gpu-server149:~/opensora2/Open-Sora$ torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x512x512.py --data-path /home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv

/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
  warnings.warn(
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
  warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
Config (path: configs/opensora/train/16x512x512.py): {'num_frames': 16, 'frame_interval': 3, 'image_size': (512, 512), 'root': None, 'data_path': '/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv', 'use_image_transform': False, 'num_workers': 4, 'dtype': 'bf16', 'grad_checkpoint': False, 'plugin': 'zero2', 'sp_size': 1, 'model': {'type': 'STDiT-XL/2', 'space_scale': 1.0, 'time_scale': 1.0, 'from_pretrained': '/home/zdw/Open-Sora/pre_training/Open-Sora/OpenSora-v1-HQ-16x512x512.pth', 'enable_flashattn': True, 'enable_layernorm_kernel': True}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/sd-vae-ft-ema', 'micro_batch_size': 128}, 'text_encoder': {'type': 't5', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/t5-v1_1-xxl', 'model_max_length': 120, 'shardformer': True}, 'scheduler': {'type': 'iddpm', 'timestep_respacing': ''}, 'seed': 42, 'outputs': 'outputs', 'wandb': False, 'epochs': 1000, 'log_every': 10, 'ckpt_every': 500, 'load': None, 'batch_size': 4, 'lr': 2e-05, 'grad_clip': 1.0, 'local_rank': 0, 'multi_resolution': False}
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
  warnings.warn("`config` is deprecated and will be removed soon.")
[04/02/24 17:47:00] INFO     colossalai - colossalai - INFO:                                                                                             
                             /data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch                   
                    INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1                                       
[2024-04-02 17:47:00] Experiment directory created at outputs/008-F16S3-STDiT-XL-2
[2024-04-02 17:47:00] Dataset contains 1 videos (/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv)
[2024-04-02 17:47:00] Total batch size: 4
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00,  8.71s/it]
Traceback (most recent call last):
  File "/home/zdw/opensora2/Open-Sora/scripts/train.py", line 287, in <module>
    main()
  File "/home/zdw/opensora2/Open-Sora/scripts/train.py", line 132, in main
    text_encoder = build_module(cfg.text_encoder, MODELS, device=device)  # T5 must be fp32
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/registry.py", line 22, in build_module
    return builder.build(cfg)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 287, in __init__
    self.shardformer_t5()
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 292, in shardformer_t5
    from opensora.acceleration.shardformer.policy.t5_encoder import T5EncoderPolicy
ModuleNotFoundError: No module named 'opensora.acceleration.shardformer'
[2024-04-02 17:47:22,023] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2289970) of binary: /data/share8/zdw/miniconda3/envs/opensora/bin/python
Traceback (most recent call last):
  File "/data/share8/zdw/miniconda3/envs/opensora/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-02_17:47:22
  host      : ai-gpu-server149
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2289970)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(opensora) zdw@ai-gpu-server149:~/opensora2/Open-Sora$ 
BountyMage commented 7 months ago

我是把acceleration下的空的init.py考到shardformer目录下的。

tiancaitzp commented 7 months ago

我是把acceleration下的空的init.py考到shardformer目录下的。

我按照你的操作,仍然会报No module named 'opensora.acceleration.shardformer'

Aziily commented 7 months ago

看起来是因为没有init.py文件,find_packages查找的时候跳过shardformer了,在opensora/acceleration/shardformer下加一个空init.py然后重新pip install -v .

github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] commented 7 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.