tdrussell / qlora-pipe

A pipeline parallel training script for LLMs.
MIT License
79 stars 5 forks source link

Training crashes with Accelerate 0.33.0 #22

Open rationalism opened 3 weeks ago

rationalism commented 3 weeks ago

If you upgrade to the new accelerate, 0.33.0, BNB QLoRA training crashes with this stack trace:

loading checkpoint file model-00001-of-00030.safetensors
load params into module <class 'llama_pipe.LlamaDecoderLayerPipe'>
Traceback (most recent call last):
  File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 418, in <module>
    pipeline_model, lora_model, lora_config = load_pipeline_model_with_lora(config, model_type)
  File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 279, in load_pipeline_model_with_lora
    pipeline_model = engine.CustomPipelineModule(
  File "/home/alyssa/lm_fun/qlora-pipe/engine.py", line 274, in __init__
    super().__init__(layers, **kwargs)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
    self._build()
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
    module = layer.build()
  File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 75, in build
    return self.typename(*self.module_args, **self.module_kwargs)
  File "/home/alyssa/lm_fun/qlora-pipe/llama_pipe.py", line 113, in __init__
    loader_util.load_state_dict_into_module(self)
  File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 316, in load_state_dict_into_module
    transformers.modeling_utils._load_state_dict_into_meta_model(
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 436, in set_module_tensor_to_device
    new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(device)
TypeError: Params4bit.__new__() got an unexpected keyword argument 'original_name'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 418, in <module>
[rank0]:     pipeline_model, lora_model, lora_config = load_pipeline_model_with_lora(config, model_type)
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 279, in load_pipeline_model_with_lora
[rank0]:     pipeline_model = engine.CustomPipelineModule(
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/engine.py", line 274, in __init__
[rank0]:     super().__init__(layers, **kwargs)
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__
[rank0]:     self._build()
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 268, in _build
[rank0]:     module = layer.build()
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 75, in build
[rank0]:     return self.typename(*self.module_args, **self.module_kwargs)
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/llama_pipe.py", line 113, in __init__
[rank0]:     loader_util.load_state_dict_into_module(self)
[rank0]:   File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 316, in load_state_dict_into_module
[rank0]:     transformers.modeling_utils._load_state_dict_into_meta_model(
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model
[rank0]:     set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
[rank0]:   File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 436, in set_module_tensor_to_device
[rank0]:     new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(device)
[rank0]: TypeError: Params4bit.__new__() got an unexpected keyword argument 'original_name'

Suspect it's because of this PR:

https://github.com/huggingface/accelerate/pull/2934

This PR might also be relevant:

https://github.com/huggingface/accelerate/pull/2986

Reverting to Accelerate 0.32.0 resolves the crash. Thank you!