neuralmagic / AutoFP8

Apache License 2.0
150 stars 17 forks source link

I get empty response after Quantizing LLama2 70B using AutoFP8 with calibration #27

Closed e3oroush closed 2 months ago

e3oroush commented 2 months ago

I'm trying to quantize llama2 70B using AutoFP8 with calibration samples from ultrachat in a DGX with H100 and cuda12.2

The model is successfully, however, in loading it on vllm, the generated token's response is always empty. The logs show it's generating something, but it doesn't return anything.

>>> vllm.__version__
'0.5.0.post1'
>>> requests.post("http://localhost:8080/v1/chat/completions",json={"messages":[{"content":"what is ai","role":"user"}], "model": "llama", "max_tokens": 200}).json()
{'id': 'cmpl-fc71adce4bc64df5b215b3051d1c698c',
 'object': 'chat.completion',
 'created': 1720516050,
 'model': 'llama',
 'choices': [{'index': 0,
   'message': {'role': 'assistant', 'content': '', 'tool_calls': []},
   'logprobs': None,
   'finish_reason': 'length',
   'stop_reason': None}],
 'usage': {'prompt_tokens': 11, 'total_tokens': 211, 'completion_tokens': 200}}

And here is the command I use to start the server:

python3 -m vllm.entrypoints.openai.api_server --model ./Meta-Llama-2-70B-hf-FP8-KV  --port 8080 --served-model-name llama --kv-cache-dtype fp8

Any hints would be appreciated.

mgoin commented 2 months ago

Can you share the full code you used here to produce the model and/or the checkpoint? Do you get the same result without specifying --kv-cache-dtype fp8?

We have an accurate llama 2 7B checkpoint without kv cache quant: https://huggingface.co/neuralmagic/Llama-2-7b-chat-hf-FP8

e3oroush commented 2 months ago

Thanks for your response.
I used the exact same code for llama3-70B just changed the model. Here's the code:

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Llama-2-70b-hf"
quantized_model_dir = "Meta-Llama-2-70B-hf-FP8-KV"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft")
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static",
    ignore_patterns=["re:.*lm_head"],
    kv_cache_quant_targets=("k_proj", "v_proj"),
)

model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Yes I get the exact same response with or without --kv-cache-dtype fp8

mgoin commented 2 months ago

@e3oroush I think part of is the issue is that you are using a base model, not a chat-finetuned model. In your example you are both applying a chat template during calibration and in the openai-api. I think you want to start with https://huggingface.co/meta-llama/Llama-2-70b-chat-hf

e3oroush commented 2 months ago

You're right. But I was expecting it to generate none-sense, not to return empty string.
I tried the /completions endpoint and the result was the same. I was able to use the dynamic method using the following code though:

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Llama-2-70b-hf"
quantized_model_dir = "Meta-Llama-2-70B-hf-FP8-Dynamic"

# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
# For dynamic activation scales, there is no need for calbration examples
examples = []

# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

And I got response. However, based on the documentation, it's not the most optimized method.

mgoin commented 2 months ago

@e3oroush could you try static activation scales without quantizing the kv cache i.e. remove kv_cache_quant_targets=("k_proj", "v_proj"),

e3oroush commented 2 months ago

I'll do it and get back to you with the results. Thanks.

e3oroush commented 2 months ago

Apparently, the Static method with calibration has some problem with Llama2. I tried for both meta-llama/Llama-2-70b-hf and meta-llama/Llama-2-70b-chat-hf got the same result. Meaning, the server generates something, but got nothing in response. My guess it\s because of its tokenizer and decoding section. However, the same setup with Llama3 works fine.

mgoin commented 2 months ago

@e3oroush I am seeing this as well, here is the checkpoint I made: https://huggingface.co/nm-testing/Llama-2-70b-chat-hf-FP8 This is the only model so far that we've ran into this issue, so I'm really curious on what the issue is.

e3oroush commented 2 months ago

So my guess was correct and it was because of its tokenizer. I made a small change on the script and set max_length=4096 so the tokenizer knows how to truncate the sequences. Here's the final script worked for me:

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "meta-llama/Llama-2-70b-hf"
quantized_model_dir = "Meta-Llama-2-70B-hf-FP8-KV"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft")
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt", max_length=4096).to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static",
    ignore_patterns=["re:.*lm_head"],
)

model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)