hiyouga / LLaMA-Factory

Efficiently Fine-Tune 100+ LLMs in WebUI (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
31.17k stars 3.84k forks source link

LLaMA-Factory 微调Qwen2-7B-Instruct-GPTQ-Int8 ,训练好之后导出模型(export)发生错误 #4647

Closed caijx168 closed 2 months ago

caijx168 commented 2 months ago

Reminder

System Info

(LLaMA-Factory) root@root1-System-Product-Name:/home/LLaMA-Factory# pip list Package Version


accelerate 0.29.2 aiofiles 23.2.1 aiohttp 3.9.4 aiosignal 1.3.1 altair 5.3.0 annotated-types 0.6.0 anyio 4.3.0 async-timeout 4.0.3 attrs 23.2.0 auto_gptq 0.7.1 autoawq 0.2.4 autoawq_kernels 0.0.6 bitsandbytes 0.43.1 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.29.2 coloredlogs 15.0.1 contourpy 1.2.1 cycler 0.12.1 datasets 2.18.0 dill 0.3.8 diskcache 5.6.3 docstring_parser 0.16 einops 0.7.0 exceptiongroup 1.2.0 fastapi 0.110.1 ffmpy 0.3.2 filelock 3.13.4 fire 0.6.0 fonttools 4.51.0 frozenlist 1.4.1 fsspec 2024.2.0 gekko 1.1.1 gradio 4.21.0 gradio_client 0.12.0 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.22.2 humanfriendly 10.0 idna 3.7 importlib_resources 6.4.0 interegular 0.3.3 Jinja2 3.1.3 joblib 1.4.0 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 kiwisolver 1.4.5 lark 1.1.9 llvmlite 0.42.0 markdown-it-py 3.0.0 MarkupSafe 2.1.5 matplotlib 3.8.4 mdurl 0.1.2 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.3 ninja 1.11.1.1 numba 0.59.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.18.1 nvidia-nvjitlink-cu12 12.4.127 nvidia-nvtx-cu12 12.1.105 optimum 1.18.1 orjson 3.10.0 outlines 0.0.34 packaging 24.0 pandas 2.2.2 peft 0.10.0 pillow 10.3.0 pip 23.3.1 prometheus_client 0.20.0 protobuf 5.26.1 psutil 5.9.8 py-cpuinfo 9.0.0 pyarrow 15.0.2 pyarrow-hotfix 0.6 pydantic 2.7.0 pydantic_core 2.18.1 pydub 0.25.1 Pygments 2.17.2 pynvml 11.5.0 pyparsing 3.1.2 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 python-multipart 0.0.9 pytz 2024.1 PyYAML 6.0.1 ray 2.11.0 referencing 0.34.0 regex 2023.12.25 requests 2.31.0 rich 13.7.1 rouge 1.0.1 rpds-py 0.18.0 ruff 0.3.7 safetensors 0.4.2 scipy 1.13.0 semantic-version 2.10.0 sentencepiece 0.2.0 setuptools 68.2.2 shellingham 1.5.4 shtab 1.7.1 six 1.16.0 sniffio 1.3.1 sse-starlette 2.1.0 starlette 0.37.2 sympy 1.12 termcolor 2.4.0 tiktoken 0.6.0 tokenizers 0.19.1 tomlkit 0.12.0 toolz 0.12.1 torch 2.1.2 tqdm 4.66.2 transformers 4.40.0 transformers-stream-generator 0.0.5 triton 2.1.0 trl 0.8.2 typer 0.12.3 typing_extensions 4.11.0 tyro 0.8.3 tzdata 2024.1 urllib3 2.2.1 uvicorn 0.29.0 uvloop 0.19.0 vllm 0.4.0.post1 watchfiles 0.21.0 websockets 11.0.3 wheel 0.41.2 xformers 0.0.23.post1 xxhash 3.4.1 yarl 1.9.4 zstandard 0.22.0

Reproduction

(LLaMA-Factory) root@root1-System-Product-Name:/home/LLaMA-Factory# CUDA_VISIBLE_DEVICES=0 USE_MODELSCOPE_HUB=1 python src/train_web.py Running on local URL: http://0.0.0.0:7860

To create a public link, set share=True in launch(). IMPORTANT: You are using gradio version 4.21.0, however version 4.29.0 is available, please upgrade.

[INFO|tokenization_utils_base.py:2085] 2024-07-02 16:01:59,025 >> loading file vocab.json [INFO|tokenization_utils_base.py:2085] 2024-07-02 16:01:59,025 >> loading file merges.txt [INFO|tokenization_utils_base.py:2085] 2024-07-02 16:01:59,025 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2085] 2024-07-02 16:01:59,025 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2085] 2024-07-02 16:01:59,025 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2085] 2024-07-02 16:01:59,025 >> loading file tokenizer.json [WARNING|logging.py:314] 2024-07-02 16:01:59,110 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:724] 2024-07-02 16:01:59,111 >> loading configuration file /home/qwen/Qwen2-7B-Instruct-GPTQ-Int8/config.json [INFO|configuration_utils.py:789] 2024-07-02 16:01:59,111 >> Model config Qwen2Config { "_name_or_path": "/home/qwen/Qwen2-7B-Instruct-GPTQ-Int8", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "quantization_config": { "batch_size": 1, "bits": 8, "block_name_to_quantize": null, "cache_block_outputs": true, "damp_percent": 0.1, "dataset": null, "desc_act": false, "exllama_config": { "version": 1 }, "group_size": 128, "max_input_length": null, "model_seqlen": null, "module_name_preceding_first_block": null, "modules_in_block_to_quantize": null, "pad_token_id": null, "quant_method": "gptq", "sym": true, "tokenizer": null, "true_sequential": true, "use_cuda_fp16": false, "use_exllama": true }, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "float16", "transformers_version": "4.40.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

07/02/2024 16:01:59 - INFO - llmtuner.model.patcher - Loading 8-bit GPTQ-quantized model. 07/02/2024 16:01:59 - INFO - llmtuner.model.patcher - Using KV cache for faster generation. [INFO|quantizer_gptq.py:68] 2024-07-02 16:01:59,139 >> We suggest you to set torch_dtype=torch.float16 for better efficiency with GPTQ. [INFO|modeling_utils.py:3426] 2024-07-02 16:01:59,139 >> loading weights file /home/qwen/Qwen2-7B-Instruct-GPTQ-Int8/model.safetensors.index.json [INFO|modeling_utils.py:1494] 2024-07-02 16:01:59,139 >> Instantiating Qwen2ForCausalLM model under default dtype torch.float32. [INFO|configuration_utils.py:928] 2024-07-02 16:01:59,139 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 }

/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/transformers/modeling_utils.py:4371: FutureWarning: _is_quantized_training_enabled is going to be deprecated in transformers 4.39.0. Please use model.hf_quantizer.is_trainable instead warnings.warn( Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.18it/s] [INFO|modeling_utils.py:4170] 2024-07-02 16:02:00,963 >> All model checkpoint weights were used when initializing Qwen2ForCausalLM.

[INFO|modeling_utils.py:4178] 2024-07-02 16:02:00,963 >> All the weights of Qwen2ForCausalLM were initialized from the model checkpoint at /home/qwen/Qwen2-7B-Instruct-GPTQ-Int8. If your task is similar to the task the model of the checkpoint was trained on, you can already use Qwen2ForCausalLM for predictions without further training. [INFO|modeling_utils.py:3719] 2024-07-02 16:02:00,964 >> Generation config file not found, using a generation config created from the model config. 07/02/2024 16:02:01 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA 07/02/2024 16:02:01 - INFO - llmtuner.model.adapter - Loaded adapter(s): saves/Custom/lora/train_2024-07-02-15-10-29 07/02/2024 16:02:01 - INFO - llmtuner.model.loader - all params: 1092722176 Traceback (most recent call last): File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/queueing.py", line 501, in call_prediction output = await route_utils.call_process_api( File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/route_utils.py", line 253, in call_process_api output = await app.get_blocks().process_api( File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/blocks.py", line 1695, in process_api result = await self.call_function( File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/blocks.py", line 1247, in call_function prediction = await utils.async_iteration(iterator) File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/utils.py", line 516, in async_iteration return await iterator.anext() File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/utils.py", line 509, in anext return await anyio.to_thread.run_sync( File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2144, in run_sync_in_worker_thread return await future File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 851, in run result = context.run(func, *args) File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/utils.py", line 492, in run_sync_iterator_async return next(iterator) File "/root/anaconda3/envs/LLaMA-Factory/lib/python3.10/site-packages/gradio/utils.py", line 675, in gen_wrapper response = next(iterator) File "/home/LLaMA-Factory/src/llmtuner/webui/components/export.py", line 71, in save_model export_model(args) File "/home/LLaMA-Factory/src/llmtuner/train/tuner.py", line 60, in export_model raise ValueError("Cannot merge adapters to a quantized model.") ValueError: Cannot merge adapters to a quantized model.

Expected behavior

帮我解决错误问题,能正确导出,训练后的合并模型

Others

微调已经训练好,就是导出发生错误,提示ValueError: Cannot merge adapters to a quantized model. ,我采用的是阿里的大模型Qwen2-7B-Instruct-GPTQ-Int8。

hiyouga commented 2 months ago

量化模型不支持导出

citisy commented 2 months ago

量化模型不支持导出

你好,我也是基于阿里的大模型Qwen2-7B-Instruct-GPTQ-Int8,使用int8量化训练了一个lora模型,如果不支持导出,要怎样使用transformers或者vllm去加载模型并进行推理?

ailungoal commented 2 months ago

量化模型可以微调吗?

Ryan-0805 commented 1 month ago

量化模型不支持导出

你好,我也是基于阿里的大模型Qwen2-7B-Instruct-GPTQ-Int8,使用int8量化训练了一个lora模型,如果不支持导出,要怎样使用transformers或者vllm去加载模型并进行推理?

vllm可以用lora adapater

myboyliu2025 commented 1 month ago

使用非量化的官方 Llama-3-8B-Instruct 模型合并模型: import json

args = dict( model_name_or_path="meta-llama/Meta-Llama-3-8B-Instruct", # 使用非量化的官方 Llama-3-8B-Instruct 模型 adapter_name_or_path="llama3_lora", # 加载之前保存的 LoRA 适配器 template="llama3", # 和训练保持一致 finetuning_type="lora", # 和训练保持一致 export_dir="llama3_lora_merged", # 合并后模型的保存目录 export_size=2, # 合并后模型每个权重文件的大小(单位:GB) export_device="cpu", # 合并模型使用的设备:cpucuda

export_hub_model_id="your_id/your_model", # 用于上传模型的 HuggingFace 模型 ID

)

json.dump(args, open("merge_llama3.json", "w", encoding="utf-8"), indent=2)

%cd /content/LLaMA-Factory/

!llamafactory-cli export merge_llama3.json