linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training
BSD 2-Clause "Simplified" License
3.1k stars 159 forks source link

Encountered errors when reproducing lightning training example #271

Open ReginaZh opened 4 days ago

ReginaZh commented 4 days ago

🐛 Describe the bug

Tried to reproduce the liger kernel optimization on lighting trainer with deepspeed zero3 but encountered several errors.



cd /examples/lightning/
python --model Qwen/Qwen2-0.5B-Instruct --num_gpu 1 --max_length 1024 --strategy deepspeed


[INFO] [] Setting ds_accelerator to cuda (auto detect)
/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/ UserWarning: onnxruntime training package info: package_name: onnxruntime-training
  warnings.warn("onnxruntime training package info: package_name: %s" % package_name)
/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/ UserWarning: onnxruntime training package info: __version__: 1.18.0
  warnings.warn("onnxruntime training package info: __version__: %s" % version)
/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/ UserWarning: onnxruntime training package info: cuda_version: 12.2
  warnings.warn("onnxruntime training package info: cuda_version: %s" % cuda_version)
/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/ UserWarning: onnxruntime build info: cudart_version: 12020
  warnings.warn("onnxruntime build info: cudart_version: %s" % cudart_version)
/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/ UserWarning: WARNING: failed to find cudart version that matches onnxruntime build info
  warnings.warn("WARNING: failed to find cudart version that matches onnxruntime build info")
/opt/conda/envs/ptca/lib/python3.10/site-packages/onnxruntime/capi/ UserWarning: WARNING: found cudart versions: [12010]
  warnings.warn("WARNING: found cudart versions: %s" % local_cudart_versions)
2024-09-26 03:11:07.596978: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-26 03:11:07.611316: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-26 03:11:07.615979: E external/local_xla/xla/stream_executor/cuda/] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-26 03:11:07.627834: I tensorflow/core/platform/] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-26 03:11:08.472073: W tensorflow/compiler/tf2tensorrt/utils/] TF-TRT Warning: Could not find TensorRT
Seed set to 42
2024-09-26 03:11:09,359 root [WARNING] - Cannot import JIT optimized kernels. CUDA extension will be disabled.
Traceback (most recent call last):
  File "/Liger-Kernel/examples/lightning/", line 289, in <module>
  File "/Liger-Kernel/examples/lightning/", line 257, in train
    strategy = DeepSpeedStrategy(stage=3)
  File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/strategies/", line 305, in __init__
AttributeError: module 'deepspeed.utils' has no attribute 'logging'

I fixed above error by adding "import deepspeed" in, but after that another error raised:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/Liger-Kernel/examples/lightning/", line 289, in <module>
[rank0]:     train()
[rank0]:   File "/Liger-Kernel/examples/lightning/", line 285, in train
[rank0]:, datamodule=data_module)
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/trainer/", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/trainer/", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/strategies/launchers/", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/trainer/", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/trainer/", line 945, in _run
[rank0]:     call._call_configure_model(self)
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/trainer/", line 119, in _call_configure_model
[rank0]:     _call_lightning_module_hook(trainer, "configure_model")
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/lightning/pytorch/trainer/", line 167, in _call_lightning_module_hook
[rank0]:     output = fn(*args, **kwargs)
[rank0]:   File "/Liger-Kernel/examples/lightning/", line 76, in configure_model
[rank0]:     self.model = AutoLigerKernelForCausalLM.from_pretrained(
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/liger_kernel/transformers/", line 31, in from_pretrained
[rank0]:     return super().from_pretrained(
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/models/auto/", line 564, in from_pretrained
[rank0]:     return model_class.from_pretrained(
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/", line 3838, in from_pretrained
[rank0]:     ) = cls._load_pretrained_model(
[rank0]:   File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/", line 4349, in _load_pretrained_model
[rank0]:     raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
[rank0]: RuntimeError: Error(s) in loading state_dict for Qwen2ForCausalLM:
[rank0]:        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([151936, 896]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]:        size mismatch for model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([896, 896]) from checkpoint, the shape in current model is torch.Size([0]).
[rank0]:        size mismatch for model.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([896]) from checkpoint, the shape in current model is torch.Size([0]).


Environment Report:

Operating System: Linux-6.5.0-1025-azure-x86_64-with-glibc2.31 Python version: 3.10.14 PyTorch version: 2.4.1+cu121 CUDA version: 12.1 Triton version: 3.0.0 Transformers version: 4.42.4 deepspeed version: 0.15.0 liger_kernel version 0.3.0

yundai424 commented 1 day ago

i think it's related to the deepspeed model init method. When using deepspeed the model should be initialized in a context where all new tensor creation will have 0 shape and it's inside of deepspeed source to implement the sharding & broadcast. There could be something falling off either throughout liger diffs or deepspeed/HF new version release. Will TAL and get back to this issue asap.

yundai424 commented 1 day ago

So it was ignore_mismatch_shapes=True occasionally dropped and it has been fixed very recently in 😄 @ReginaZh you can try to install liger-kernel-lightly and it should fix your issue. @shimizust do you think we can make a quick patch release for it 🤔 ?