RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception

rlsu9 commented 1 week ago

Hi, thanks for your amazing work When I run the training script with the following command: CUDA_VISIBLE_DEVICES=1 /home/user/anaconda3/envs/opensorav1.2/bin/torchrun --standalone --nproc_per_node 1 scripts/train.py configs/opensora-v1-2/train/stage1.py --data-path ./single_video_caption.csv

I got the following error:

[rank0]:   File "/home/user/anaconda3/envs/opensorav1.2/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.

Do you have any idea of the bug? Thanks for your help in advance.

syc11-25 commented 1 week ago

我也遇到了一样的问题，只能把t5里面的shardformer=false

rlsu9 commented 1 week ago

Thanks a lot for your help.

hpcaitech / Open-Sora

RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception #521