Closed e3oroush closed 2 months ago
Can you share the full code you used here to produce the model and/or the checkpoint? Do you get the same result without specifying --kv-cache-dtype fp8
?
We have an accurate llama 2 7B checkpoint without kv cache quant: https://huggingface.co/neuralmagic/Llama-2-7b-chat-hf-FP8
Thanks for your response.
I used the exact same code for llama3-70B just changed the model. Here's the code:
from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "meta-llama/Llama-2-70b-hf"
quantized_model_dir = "Meta-Llama-2-70B-hf-FP8-KV"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft")
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")
quantize_config = BaseQuantizeConfig(
quant_method="fp8",
activation_scheme="static",
ignore_patterns=["re:.*lm_head"],
kv_cache_quant_targets=("k_proj", "v_proj"),
)
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
Yes I get the exact same response with or without --kv-cache-dtype fp8
@e3oroush I think part of is the issue is that you are using a base model, not a chat-finetuned model. In your example you are both applying a chat template during calibration and in the openai-api. I think you want to start with https://huggingface.co/meta-llama/Llama-2-70b-chat-hf
You're right. But I was expecting it to generate none-sense, not to return empty string.
I tried the /completions
endpoint and the result was the same. I was able to use the dynamic
method using the following code though:
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "meta-llama/Llama-2-70b-hf"
quantized_model_dir = "Meta-Llama-2-70B-hf-FP8-Dynamic"
# Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(quant_method="fp8", activation_scheme="dynamic")
# For dynamic activation scales, there is no need for calbration examples
examples = []
# Load the model, quantize, and save checkpoint
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
And I got response. However, based on the documentation, it's not the most optimized method.
@e3oroush could you try static activation scales without quantizing the kv cache i.e. remove kv_cache_quant_targets=("k_proj", "v_proj"),
I'll do it and get back to you with the results. Thanks.
Apparently, the Static method with calibration has some problem with Llama2. I tried for both meta-llama/Llama-2-70b-hf
and meta-llama/Llama-2-70b-chat-hf
got the same result. Meaning, the server generates something, but got nothing in response. My guess it\s because of its tokenizer and decoding section. However, the same setup with Llama3 works fine.
@e3oroush I am seeing this as well, here is the checkpoint I made: https://huggingface.co/nm-testing/Llama-2-70b-chat-hf-FP8 This is the only model so far that we've ran into this issue, so I'm really curious on what the issue is.
So my guess was correct and it was because of its tokenizer. I made a small change on the script and set max_length=4096
so the tokenizer knows how to truncate the sequences. Here's the final script worked for me:
from datasets import load_dataset
from transformers import AutoTokenizer
from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig
pretrained_model_dir = "meta-llama/Llama-2-70b-hf"
quantized_model_dir = "Meta-Llama-2-70B-hf-FP8-KV"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
ds = load_dataset("mgoin/ultrachat_2k", split="train_sft")
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt", max_length=4096).to("cuda")
quantize_config = BaseQuantizeConfig(
quant_method="fp8",
activation_scheme="static",
ignore_patterns=["re:.*lm_head"],
)
model = AutoFP8ForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
model.quantize(examples)
model.save_quantized(quantized_model_dir)
I'm trying to quantize llama2 70B using AutoFP8 with calibration samples from ultrachat in a DGX with H100 and cuda12.2
The model is successfully, however, in loading it on vllm, the generated token's response is always empty. The logs show it's generating something, but it doesn't return anything.
And here is the command I use to start the server:
Any hints would be appreciated.