Closed Get-David closed 7 months ago
有重新再install一遍吗?
Creat one and reinstall it
有重新再install一遍吗?
重新安装什么?
Creat one and reinstall it
Reinstall what, the conda environment, or what?
Creat one and reinstall it
Reinstall what, the conda environment, or what?
only opensora
Creat one and reinstall it
Reinstall what, the conda environment, or what?
only opensora
I tried the following command again, but it still prompted an error
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .
The error is reported as follows
(opensora) zdw@ai-gpu-server149:~/opensora2/Open-Sora$ torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x512x512.py --data-path /home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/pipeline/schedule/_utils.py:19: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(OrderedDict, _odict_flatten, _odict_unflatten)
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/utils/_pytree.py:254: UserWarning: <class 'collections.OrderedDict'> is already registered as pytree node. Overwriting the previous registration.
warnings.warn(
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/shardformer/layer/normalization.py:45: UserWarning: Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel
warnings.warn("Please install apex from source (https://github.com/NVIDIA/apex) to use the fused layernorm kernel")
Config (path: configs/opensora/train/16x512x512.py): {'num_frames': 16, 'frame_interval': 3, 'image_size': (512, 512), 'root': None, 'data_path': '/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv', 'use_image_transform': False, 'num_workers': 4, 'dtype': 'bf16', 'grad_checkpoint': False, 'plugin': 'zero2', 'sp_size': 1, 'model': {'type': 'STDiT-XL/2', 'space_scale': 1.0, 'time_scale': 1.0, 'from_pretrained': '/home/zdw/Open-Sora/pre_training/Open-Sora/OpenSora-v1-HQ-16x512x512.pth', 'enable_flashattn': True, 'enable_layernorm_kernel': True}, 'vae': {'type': 'VideoAutoencoderKL', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/sd-vae-ft-ema', 'micro_batch_size': 128}, 'text_encoder': {'type': 't5', 'from_pretrained': '/home/zdw/Open-Sora/pre_training/t5-v1_1-xxl', 'model_max_length': 120, 'shardformer': True}, 'scheduler': {'type': 'iddpm', 'timestep_respacing': ''}, 'seed': 42, 'outputs': 'outputs', 'wandb': False, 'epochs': 1000, 'log_every': 10, 'ckpt_every': 500, 'load': None, 'batch_size': 4, 'lr': 2e-05, 'grad_clip': 1.0, 'local_rank': 0, 'multi_resolution': False}
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
warnings.warn("`config` is deprecated and will be removed soon.")
[04/02/24 17:47:00] INFO colossalai - colossalai - INFO:
/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/colossalai/initialize.py:67 launch
INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 1
[2024-04-02 17:47:00] Experiment directory created at outputs/008-F16S3-STDiT-XL-2
[2024-04-02 17:47:00] Dataset contains 1 videos (/home/zdw/Open-Sora/pre_datasets/datasets1/datasets1.csv)
[2024-04-02 17:47:00] Total batch size: 4
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:17<00:00, 8.71s/it]
Traceback (most recent call last):
File "/home/zdw/opensora2/Open-Sora/scripts/train.py", line 287, in <module>
main()
File "/home/zdw/opensora2/Open-Sora/scripts/train.py", line 132, in main
text_encoder = build_module(cfg.text_encoder, MODELS, device=device) # T5 must be fp32
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/registry.py", line 22, in build_module
return builder.build(cfg)
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 287, in __init__
self.shardformer_t5()
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/opensora/models/text_encoder/t5.py", line 292, in shardformer_t5
from opensora.acceleration.shardformer.policy.t5_encoder import T5EncoderPolicy
ModuleNotFoundError: No module named 'opensora.acceleration.shardformer'
[2024-04-02 17:47:22,023] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 2289970) of binary: /data/share8/zdw/miniconda3/envs/opensora/bin/python
Traceback (most recent call last):
File "/data/share8/zdw/miniconda3/envs/opensora/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/share8/zdw/miniconda3/envs/opensora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-02_17:47:22
host : ai-gpu-server149
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2289970)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
(opensora) zdw@ai-gpu-server149:~/opensora2/Open-Sora$
我是把acceleration下的空的init.py考到shardformer目录下的。
我是把acceleration下的空的init.py考到shardformer目录下的。
我按照你的操作,仍然会报No module named 'opensora.acceleration.shardformer'
看起来是因为没有init.py文件,find_packages查找的时候跳过shardformer了,在opensora/acceleration/shardformer下加一个空init.py然后重新pip install -v .
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.
Originally posted by @BountyMage in https://github.com/hpcaitech/Open-Sora/issues/232#issuecomment-2024569674 我有同样的错误,我加了init一样报错
/home/zdw/Open-Sora/opensora/acceleration/shardformer/__init__.py