unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.37k stars 1.28k forks source link

failed finetune qwen32b_awq_int4 using lora with llama-factory #1314

Open Daya-Jin opened 20 hours ago

Daya-Jin commented 20 hours ago

I want to lora finetune Qwen2.5-32B-Instruct-AWQ model(4bit quant already) through llama-factory, but occured an error.

[INFO|configuration_utils.py:677] 2024-11-21 19:44:25,957 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:25,960 >> Model config Qwen2Config { "_name_or_path": "/home/jovyan/models/Qwen2.5-32B-Instruct-AWQ", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 27648, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 40, "num_hidden_layers": 64, "num_key_value_heads": 8, "quantization_config": { "bits": 4, "group_size": 128, "modules_to_not_convert": null, "quant_method": "awq", "version": "gemm", "zero_point": true }, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.46.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file vocab.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file merges.txt [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2475] 2024-11-21 19:44:26,973 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Converting format of dataset (num_proc=8): 100%|██████████| 50/50 [00:00<00:00, 112.68 examples/s] Running tokenizer on dataset (num_proc=8): 100%|██████████| 50/50 [00:01<00:00, 30.23 examples/s] [INFO|configuration_utils.py:677] 2024-11-21 19:44:30,365 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:30,366 >> Model config Qwen2Config { [INFO|configuration_utils.py:677] 2024-11-21 19:44:36,337 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:36,338 >> Model config Qwen2Config { [INFO|configuration_utils.py:677] 2024-11-21 19:44:47,798 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:47,801 >> Model config Qwen2Config { [WARNING|logging.py:168] 2024-11-21 19:44:47,803 >> Unsloth: /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ can only handle sequence lengths of at most 32768. But with kaiokendev's RoPE scaling of 2.0, it can be magically be extended to 65535! [INFO|configuration_utils.py:677] 2024-11-21 19:44:47,875 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:47,877 >> Model config Qwen2Config { "max_position_embeddings": 65535, "rope_scaling": { "factor": 1.999969482421875, "type": "linear" [INFO|modeling_utils.py:3934] 2024-11-21 19:44:48,668 >> loading weights file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/model.safetensors.index.json [INFO|modeling_utils.py:1670] 2024-11-21 19:44:48,693 >> Instantiating Qwen2ForCausalLM model under default dtype torch.float16. [INFO|configuration_utils.py:1096] 2024-11-21 19:44:48,696 >> Generate config GenerationConfig { Loading checkpoint shards: 100%|██████████| 5/5 [01:55<00:00, 23.19s/it] [INFO|modeling_utils.py:4800] 2024-11-21 19:46:51,741 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM. [INFO|modeling_utils.py:4808] 2024-11-21 19:46:51,741 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [INFO|configuration_utils.py:1049] 2024-11-21 19:46:51,776 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/generation_config.json [INFO|configuration_utils.py:1096] 2024-11-21 19:46:51,776 >> Generate config GenerationConfig { "do_sample": true, "eos_token_id": [ 151645, 151643 "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 [WARNING|logging.py:168] 2024-11-21 19:47:12,702 >> Unsloth 2024.10.7 patched 64 layers with 0 QKV layers, 64 O layers and 64 MLP layers. /home/jovyan/xxx/LLaMA-Factory-main/src/llamafactory/train/sft/trainer.py:54: FutureWarning:tokenizeris deprecated and will be removed in version 5.0.0 forCustomSeq2SeqTrainer.init. Useprocessingclass` instead. super().init(**kwargs) Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:698] 2024-11-21 19:47:14,443 >> Using auto half precision backend [WARNING|:208] 2024-11-21 19:47:19,827 >> ==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1 \ /| Num examples = 50 | Num Epochs = 1 O^O/ \/ \ Batch size per device = 1 | Gradient Accumulation steps = 2 \ / Total batch size = 2 | Total steps = 25 "-__-" Number of trainable parameters = 134,217,728 [rank0]: loss = super().compute_loss(model, inputs, return_outputs, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/models/_utils.py", line 1183, in _unsloth_pre_compute_loss [rank0]: return self._old_compute_loss(model, inputs, args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/transformers/trainer.py", line 3625, in compute_loss [rank0]: outputs = model(inputs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1643, in forward [rank0]: else self._run_ddp_forward(*inputs, *kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1459, in _run_ddp_forward [rank0]: return self.module(inputs, kwargs) # type: ignore[index] [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(*args, *kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/accelerate/utils/operations.py", line 820, in forward [rank0]: return model_forward(args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/accelerate/utils/operations.py", line 808, in call [rank0]: return convert_to_fp32(self.model_forward(*args, kwargs)) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast [rank0]: return func(*args, *kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner [rank0]: return disable_fn(args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn [rank0]: return fn(*args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/models/llama.py", line 1044, in PeftModelForCausalLM_fast_forward [rank0]: return self.base_model( [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, *kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 188, in forward [rank0]: return self.model.forward(*args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/models/llama.py", line 942, in _CausalLM_fast_forward [rank0]: outputs = self.model( [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, *kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/models/llama.py", line 776, in LlamaModel_fast_forward [rank0]: hidden_states = Unsloth_Offloaded_Gradient_Checkpointer.apply( [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply [rank0]: return super().apply(*args, kwargs) # type: ignore[misc] [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 465, in decorate_fwd [rank0]: return fwd(*args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/models/_utils.py", line 807, in forward [rank0]: output = forward_function(hidden_states, args) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/models/llama.py", line 491, in LlamaDecoderLayer_fast_forward [rank0]: hidden_states, self_attn_weights, present_key_value = self.self_attn( [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl [rank0]: return self._call_impl(*args, *kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl [rank0]: return forward_call(args, **kwargs) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/models/llama.py", line 436, in LlamaAttention_fast_forward [rank0]: attn_output = self.apply_o(self, attn_output) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 409, in apply_lora_o [rank0]: OW, OW_quant, OA, OB, OS = get_lora_parameters(self.o_proj) [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/unsloth/kernels/utils.py", line 78, in get_lora_parameters [rank0]: W = base_layer.weight [rank0]: File "/home/jovyan/.aip_conda/xxx/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1931, in getattr__ [rank0]: raise AttributeError( [rank0]: AttributeError: 'WQLinear_GEMM' object has no attribute 'weight'. Did you mean: 'qweight'? `

I tried various version of combinations of transformers+torch+unsloth. So what's the problem here? Or it's not support to lora finetune qwen2_awq_int4 actually. Thanks!

Daya-Jin commented 20 hours ago

my last conda env: torch 2.5.0 transformers 4.46.1 unsloth 2024.10.7

Daya-Jin commented 20 hours ago

It's wired! I found it can running normally after I add the argument --lora_dropout 0.05 except a performance warning.

[WARNING|logging.py:168] 2024-11-21 20:15:52,531 >> Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05. Unsloth will patch all other layers, except LoRA matrices, causing a performance hit. [WARNING|logging.py:168] 2024-11-21 20:16:12,201 >> Unsloth 2024.10.7 patched 64 layers with 0 QKV layers, 0 O layers and 0 MLP layers.

noticed it patched nothing but it running and exit normally indeed

Daya-Jin commented 18 hours ago

Alright I repeat this error in notebook. So I suppose this feature is not supported yet. image

Erland366 commented 16 hours ago

I don't think you supposed to finetune further a model that has been quantized with AWQ since AWQ packs and unpacks stuff on their weight -> Now it kinda have different architecture than the original Qwen.

But you can use Unsloth's one tho -> https://huggingface.co/unsloth/Qwen2.5-32B-bnb-4bit

danielhanchen commented 8 hours ago

Oh yep for now use the original 16bit weights or bitsandbytes - AWQ has a different quantization pathway