Open Daya-Jin opened 20 hours ago
my last conda env: torch 2.5.0 transformers 4.46.1 unsloth 2024.10.7
It's wired! I found it can running normally after I add the argument --lora_dropout 0.05
except a performance warning.
[WARNING|logging.py:168] 2024-11-21 20:15:52,531 >> Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05. Unsloth will patch all other layers, except LoRA matrices, causing a performance hit. [WARNING|logging.py:168] 2024-11-21 20:16:12,201 >> Unsloth 2024.10.7 patched 64 layers with 0 QKV layers, 0 O layers and 0 MLP layers.
noticed it patched nothing but it running and exit normally indeed
Alright I repeat this error in notebook. So I suppose this feature is not supported yet.
I don't think you supposed to finetune further a model that has been quantized with AWQ since AWQ packs and unpacks stuff on their weight -> Now it kinda have different architecture than the original Qwen.
But you can use Unsloth's one tho -> https://huggingface.co/unsloth/Qwen2.5-32B-bnb-4bit
Oh yep for now use the original 16bit weights or bitsandbytes - AWQ has a different quantization pathway
I want to lora finetune Qwen2.5-32B-Instruct-AWQ model(4bit quant already) through llama-factory, but occured an error.
[INFO|configuration_utils.py:677] 2024-11-21 19:44:25,957 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:25,960 >> Model config Qwen2Config { "_name_or_path": "/home/jovyan/models/Qwen2.5-32B-Instruct-AWQ", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 5120, "initializer_range": 0.02, "intermediate_size": 27648, "max_position_embeddings": 32768, "max_window_layers": 70, "model_type": "qwen2", "num_attention_heads": 40, "num_hidden_layers": 64, "num_key_value_heads": 8, "quantization_config": { "bits": 4, "group_size": 128, "modules_to_not_convert": null, "quant_method": "awq", "version": "gemm", "zero_point": true }, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.46.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 } [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file vocab.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file merges.txt [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2209] 2024-11-21 19:44:26,772 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2475] 2024-11-21 19:44:26,973 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Converting format of dataset (num_proc=8): 100%|██████████| 50/50 [00:00<00:00, 112.68 examples/s] Running tokenizer on dataset (num_proc=8): 100%|██████████| 50/50 [00:01<00:00, 30.23 examples/s] [INFO|configuration_utils.py:677] 2024-11-21 19:44:30,365 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:30,366 >> Model config Qwen2Config { [INFO|configuration_utils.py:677] 2024-11-21 19:44:36,337 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:36,338 >> Model config Qwen2Config { [INFO|configuration_utils.py:677] 2024-11-21 19:44:47,798 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:47,801 >> Model config Qwen2Config { [WARNING|logging.py:168] 2024-11-21 19:44:47,803 >> Unsloth: /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ can only handle sequence lengths of at most 32768. But with kaiokendev's RoPE scaling of 2.0, it can be magically be extended to 65535! [INFO|configuration_utils.py:677] 2024-11-21 19:44:47,875 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/config.json [INFO|configuration_utils.py:746] 2024-11-21 19:44:47,877 >> Model config Qwen2Config { "max_position_embeddings": 65535, "rope_scaling": { "factor": 1.999969482421875, "type": "linear" [INFO|modeling_utils.py:3934] 2024-11-21 19:44:48,668 >> loading weights file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/model.safetensors.index.json [INFO|modeling_utils.py:1670] 2024-11-21 19:44:48,693 >> Instantiating Qwen2ForCausalLM model under default dtype torch.float16. [INFO|configuration_utils.py:1096] 2024-11-21 19:44:48,696 >> Generate config GenerationConfig { Loading checkpoint shards: 100%|██████████| 5/5 [01:55<00:00, 23.19s/it] [INFO|modeling_utils.py:4800] 2024-11-21 19:46:51,741 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM. [INFO|modeling_utils.py:4808] 2024-11-21 19:46:51,741 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [INFO|configuration_utils.py:1049] 2024-11-21 19:46:51,776 >> loading configuration file /home/jovyan/models/Qwen2.5-32B-Instruct-AWQ/generation_config.json [INFO|configuration_utils.py:1096] 2024-11-21 19:46:51,776 >> Generate config GenerationConfig { "do_sample": true, "eos_token_id": [ 151645, 151643 "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.7, "top_k": 20, "top_p": 0.8 [WARNING|logging.py:168] 2024-11-21 19:47:12,702 >> Unsloth 2024.10.7 patched 64 layers with 0 QKV layers, 64 O layers and 64 MLP layers. /home/jovyan/xxx/LLaMA-Factory-main/src/llamafactory/train/sft/trainer.py:54: FutureWarning:
tokenizeris deprecated and will be removed in version 5.0.0 for
CustomSeq2SeqTrainer.init. Use
processingclass` instead. super().init(**kwargs) Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [INFO|trainer.py:698] 2024-11-21 19:47:14,443 >> Using auto half precision backend [WARNING|I tried various version of combinations of transformers+torch+unsloth. So what's the problem here? Or it's not support to lora finetune qwen2_awq_int4 actually. Thanks!