Open rationalism opened 3 weeks ago
If you upgrade to the new accelerate, 0.33.0, BNB QLoRA training crashes with this stack trace:
loading checkpoint file model-00001-of-00030.safetensors load params into module <class 'llama_pipe.LlamaDecoderLayerPipe'> Traceback (most recent call last): File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 418, in <module> pipeline_model, lora_model, lora_config = load_pipeline_model_with_lora(config, model_type) File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 279, in load_pipeline_model_with_lora pipeline_model = engine.CustomPipelineModule( File "/home/alyssa/lm_fun/qlora-pipe/engine.py", line 274, in __init__ super().__init__(layers, **kwargs) File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__ self._build() File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 268, in _build module = layer.build() File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 75, in build return self.typename(*self.module_args, **self.module_kwargs) File "/home/alyssa/lm_fun/qlora-pipe/llama_pipe.py", line 113, in __init__ loader_util.load_state_dict_into_module(self) File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 316, in load_state_dict_into_module transformers.modeling_utils._load_state_dict_into_meta_model( File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 436, in set_module_tensor_to_device new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(device) TypeError: Params4bit.__new__() got an unexpected keyword argument 'original_name' [rank0]: Traceback (most recent call last): [rank0]: File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 418, in <module> [rank0]: pipeline_model, lora_model, lora_config = load_pipeline_model_with_lora(config, model_type) [rank0]: File "/home/alyssa/lm_fun/qlora-pipe/train.py", line 279, in load_pipeline_model_with_lora [rank0]: pipeline_model = engine.CustomPipelineModule( [rank0]: File "/home/alyssa/lm_fun/qlora-pipe/engine.py", line 274, in __init__ [rank0]: super().__init__(layers, **kwargs) [rank0]: File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 212, in __init__ [rank0]: self._build() [rank0]: File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 268, in _build [rank0]: module = layer.build() [rank0]: File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 75, in build [rank0]: return self.typename(*self.module_args, **self.module_kwargs) [rank0]: File "/home/alyssa/lm_fun/qlora-pipe/llama_pipe.py", line 113, in __init__ [rank0]: loader_util.load_state_dict_into_module(self) [rank0]: File "/home/alyssa/lm_fun/qlora-pipe/pipeline_model.py", line 316, in load_state_dict_into_module [rank0]: transformers.modeling_utils._load_state_dict_into_meta_model( [rank0]: File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/transformers/modeling_utils.py", line 961, in _load_state_dict_into_meta_model [rank0]: set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs) [rank0]: File "/home/alyssa/anaconda3/envs/lm_fun/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 436, in set_module_tensor_to_device [rank0]: new_value = param_cls(new_value, requires_grad=old_value.requires_grad, **kwargs).to(device) [rank0]: TypeError: Params4bit.__new__() got an unexpected keyword argument 'original_name'
Suspect it's because of this PR:
https://github.com/huggingface/accelerate/pull/2934
This PR might also be relevant:
https://github.com/huggingface/accelerate/pull/2986
Reverting to Accelerate 0.32.0 resolves the crash. Thank you!
If you upgrade to the new accelerate, 0.33.0, BNB QLoRA training crashes with this stack trace:
Suspect it's because of this PR:
https://github.com/huggingface/accelerate/pull/2934
This PR might also be relevant:
https://github.com/huggingface/accelerate/pull/2986
Reverting to Accelerate 0.32.0 resolves the crash. Thank you!