Hi, thanks for your amazing work When I run the training script with the following command:
CUDA_VISIBLE_DEVICES=1 /home/user/anaconda3/envs/opensorav1.2/bin/torchrun --standalone --nproc_per_node 1 scripts/train.py configs/opensora-v1-2/train/stage1.py --data-path ./single_video_caption.csv
I got the following error:
[rank0]: File "/home/user/anaconda3/envs/opensorav1.2/lib/python3.9/site-packages/colossalai/shardformer/shard/sharder.py", line 201, in _replace_sub_module
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: Failed to replace layer_norm of type T5LayerNorm with T5LayerNorm with the exception: Recovering T5LayerNorm requires the original layer to be apex's Fused RMS Norm.Apex's fused norm is automatically used by Hugging Face Transformers https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L265C5-L265C48. Please check your model configuration or sharding policy, you can set up an issue for us to help you as well.
Do you have any idea of the bug? Thanks for your help in advance.
Hi, thanks for your amazing work When I run the training script with the following command:
CUDA_VISIBLE_DEVICES=1 /home/user/anaconda3/envs/opensorav1.2/bin/torchrun --standalone --nproc_per_node 1 scripts/train.py configs/opensora-v1-2/train/stage1.py --data-path ./single_video_caption.csv
I got the following error:
Do you have any idea of the bug? Thanks for your help in advance.