Cannot use Quantization bit 4 for prediction

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

Here is my Training & Eval & Prediction script

The training for LoRA is done with Quantization bit 4, so it would nice if we can predict trained results using llama_factory's pipeline with bits and bytes quantizations right away

#!/bin/bash

eval "$(conda shell.bash hook)"
conda activate llama_factory

MODEL_NAME=Qwen-1_8B-Chat
STAGE=sft
EPOCH=.01 #3.0
DATA=alpaca_gpt4_zh
SAVE_PATH=./models/$STAGE/$MODEL_NAME-$STAGE-$DATA-$EPOCH
SAVE_PATH_PREDICT=./models/$STAGE/$MODEL_NAME-$STAGE-$DATA-$EPOCH/Predict
MODEL_PATH=./models/$MODEL_NAME
LoRA_TARGET=c_attn #q_proj,v_proj
TEMPLATE=qwen #default

if [ ! -d $MODEL_PATH ]; then
    echo "Model not found: $MODEL_PATH"
    return 1
fi

if [ ! -d $SAVE_PATH ]; then
    mkdir -p $SAVE_PATH
fi

if [ ! -d $SAVE_PATH_PREDICT ]; then
    mkdir -p $SAVE_PATH_PREDICT
fi

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --seed 42 \
    --stage $STAGE \
    --model_name_or_path $MODEL_PATH \
    --dataset $DATA \
    --val_size .1 \
    --template $TEMPLATE \
    --finetuning_type lora \
    --do_train \
    --lora_target $LoRA_TARGET \
    --output_dir $SAVE_PATH \
    --overwrite_output_dir \
    --overwrite_cache \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 4 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 5e-5 \
    --num_train_epochs $EPOCH \
    --do_eval \
    --evaluation_strategy epoch \
    --per_device_eval_batch_size 4 \
    --prediction_loss_only \
    --plot_loss \
    --quantization_bit 4 \
    | tee $SAVE_PATH/training_log.txt

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage $STAGE \
    --model_name_or_path $MODEL_PATH \
    --do_predict \
    --max_samples 100 \
    --predict_with_generate \
    --dataset $DATA \
    --template $TEMPLATE \
    --finetuning_type lora \
    --checkpoint_dir $SAVE_PATH \
    --output_dir $SAVE_PATH_PREDICT \
    --per_device_eval_batch_size 4 \
    --quantization_bit 4 \
    | tee $SAVE_PATH_PREDICT/predict_log.txt

It spits out the following output:

/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
12/04/2023 14:00:16 - WARNING - llmtuner.model.parser - Evaluating model in 4/8-bit mode may cause lower scores.
12/04/2023 14:00:16 - WARNING - llmtuner.model.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
[INFO|training_args.py:1345] 2023-12-04 14:00:16,347 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-04 14:00:16,347 >> PyTorch: setting up devices
/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/training_args.py:1711: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
12/04/2023 14:00:16 - INFO - llmtuner.model.parser - Process rank: 0, device: cuda:0, n_gpu: 1
  distributed training: True, compute dtype: None
12/04/2023 14:00:16 - INFO - llmtuner.model.parser - Training/evaluation parameters Seq2SeqTrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=False,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=False,
do_predict=True,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=IntervalStrategy.NO,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
generation_config=None,
generation_max_length=None,
generation_num_beams=None,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_always_push=False,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict/runs/Dec04_14-00-16_yhyu13fuwuqi,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=500,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=OptimizerNames.ADAMW_TORCH,
optim_args=None,
output_dir=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=4,
per_device_train_batch_size=8,
predict_with_generate=True,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01/Predict,
save_on_each_node=False,
save_safetensors=False,
save_steps=500,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=True,
sortish_sampler=False,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
12/04/2023 14:00:16 - INFO - llmtuner.data.loader - Loading dataset alpaca_gpt4_data_zh.json...
Using custom data configuration default-d0b7f73168407ceb
Loading Dataset Infos from /home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/datasets/packaged_modules/json
Overwrite dataset info from restored data version if exists.
Loading Dataset info from /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
Found cached dataset json (/home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Loading Dataset info from /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file qwen.tiktoken
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2013] 2023-12-04 14:00:17,307 >> loading file tokenizer.json
[INFO|configuration_utils.py:713] 2023-12-04 14:00:17,767 >> loading configuration file ./models/Qwen-1_8B-Chat/config.json
[INFO|configuration_utils.py:713] 2023-12-04 14:00:17,768 >> loading configuration file ./models/Qwen-1_8B-Chat/config.json
[INFO|configuration_utils.py:775] 2023-12-04 14:00:17,769 >> Model config QWenConfig {
  "_name_or_path": "./models/Qwen-1_8B-Chat",
  "architectures": [
    "QWenLMHeadModel"
  ],
  "attn_dropout_prob": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_qwen.QWenConfig",
    "AutoModelForCausalLM": "modeling_qwen.QWenLMHeadModel"
  },
  "bf16": false,
  "emb_dropout_prob": 0.0,
  "fp16": false,
  "fp32": false,
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "kv_channels": 128,
  "layer_norm_epsilon": 1e-06,
  "max_position_embeddings": 8192,
  "model_type": "qwen",
  "no_bias": true,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "onnx_safe": null,
  "rotary_emb_base": 10000,
  "rotary_pct": 1.0,
  "scale_attn_weights": true,
  "seq_length": 8192,
  "softmax_in_fp32": false,
  "tie_word_embeddings": false,
  "tokenizer_class": "QWenTokenizer",
  "transformers_version": "4.34.1",
  "use_cache": true,
  "use_cache_kernel": false,
  "use_cache_quantization": false,
  "use_dynamic_ntk": true,
  "use_flash_attn": "auto",
  "use_logn_attn": true,
  "vocab_size": 151936
}

12/04/2023 14:00:17 - INFO - llmtuner.model.loader - Quantizing model to 4 bit.
[INFO|modeling_utils.py:2990] 2023-12-04 14:00:17,792 >> loading weights file ./models/Qwen-1_8B-Chat/model.safetensors.index.json
[INFO|modeling_utils.py:1220] 2023-12-04 14:00:17,792 >> Instantiating QWenLMHeadModel model under default dtype torch.float16.
[INFO|configuration_utils.py:770] 2023-12-04 14:00:17,792 >> Generate config GenerationConfig {}

Try importing flash-attention for faster inference...
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
[INFO|modeling_utils.py:3103] 2023-12-04 14:00:18,107 >> Detected 4-bit loading: activating 4-bit loading for this model
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.49it/s]
[INFO|modeling_utils.py:3775] 2023-12-04 14:00:19,537 >> All model checkpoint weights were used when initializing QWenLMHeadModel.

[INFO|modeling_utils.py:3783] 2023-12-04 14:00:19,538 >> All the weights of QWenLMHeadModel were initialized from the model checkpoint at ./models/Qwen-1_8B-Chat.
If your task is similar to the task the model of the checkpoint was trained on, you can already use QWenLMHeadModel for predictions without further training.
[INFO|configuration_utils.py:728] 2023-12-04 14:00:19,539 >> loading configuration file ./models/Qwen-1_8B-Chat/generation_config.json
[INFO|configuration_utils.py:770] 2023-12-04 14:00:19,540 >> Generate config GenerationConfig {
  "chat_format": "chatml",
  "do_sample": true,
  "eos_token_id": 151643,
  "max_new_tokens": 512,
  "max_window_size": 6144,
  "pad_token_id": 151643,
  "repetition_penalty": 1.1,
  "top_k": 0,
  "top_p": 0.8
}

12/04/2023 14:00:19 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/peft/tuners/lora/bnb.py:213: UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
  warnings.warn(
12/04/2023 14:00:20 - INFO - llmtuner.model.adapter - Merged 1 model checkpoint(s).
12/04/2023 14:00:20 - INFO - llmtuner.model.adapter - Loaded fine-tuned model from checkpoint(s): ./models/sft/Qwen-1_8B-Chat-sft-alpaca_gpt4_zh-.01
12/04/2023 14:00:20 - INFO - llmtuner.model.loader - trainable params: 0 || all params: 1836828672 || trainable%: 0.0000
12/04/2023 14:00:20 - INFO - llmtuner.model.loader - This IS expected that the trainable params is 0 if you are using model for inference only.
12/04/2023 14:00:20 - INFO - llmtuner.data.template - Add eos token: <|endoftext|>
12/04/2023 14:00:20 - INFO - llmtuner.data.template - Add pad token: <|endoftext|>
Loading cached processed dataset at /home/hangyu5/.cache/huggingface/datasets/json/default-d0b7f73168407ceb/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96/cache-71179182d092b457.arrow
[INFO|training_args.py:1345] 2023-12-04 14:00:21,076 >> Found safetensors installation, but --save_safetensors=False. Safetensors should be a preferred weights saving format due to security and performance reasons. If your model cannot be saved by safetensors please feel free to open an issue at https://github.com/huggingface/safetensors!
[INFO|training_args.py:1798] 2023-12-04 14:00:21,076 >> PyTorch: setting up devices
Traceback (most recent call last):
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 14, in <module>
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 100662, 108136, 101124, 45139, 1773, 151645, 198, 151644, 77091, 198]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
保持健康的三个提示。<|im_end|>
<|im_start|>assistant

    main()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/tuner.py", line 26, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 50, in run_sft
    trainer = CustomSeq2SeqTrainer(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer_seq2seq.py", line 56, in __init__
    super().__init__(
  File "/home/hangyu5/anaconda3/envs/llama_factory/lib/python3.11/site-packages/transformers/trainer.py", line 412, in __init__
    raise ValueError(
ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details

But prediction and training cannot be enabled in the same time:

Traceback (most recent call last):
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 14, in <module>
    main()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/train_bash.py", line 5, in main
    run_exp()
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/train/tuner.py", line 20, in run_exp
    model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args)
                                                                             ^^^^^^^^^^^^^^^^^^^^
  File "/home/hangyu5/Documents/Git-repoMy/AIResearchVault/repo/LLM-infrastructure/LLaMA-Factory/src/llmtuner/model/parser.py", line 112, in get_train_args
    raise ValueError("`predict_with_generate` cannot be set as True while training.")
ValueError: `predict_with_generate` cannot be set as True while training.

Expected behavior

We should find a way to support evluating and prediction using quantizations as well.

System Info

Ubutu 22.04 3090 pytorch 2.1.1 cuda12.1 flash attn2 Latest LLaMA_factory https://github.com/hiyouga/LLaMA-Factory/commit/d3dccd0693ede18a99f04780f2fd6e3a89810405 Base model https://huggingface.co/Qwen/Qwen-1_8B-Chat

Others

No response

hiyouga / LLaMA-Factory

Cannot use Quantization bit 4 for prediction #1735

Reminder

Reproduction

Expected behavior

System Info

Others