Open jiafuzha opened 7 months ago
we have fixed it on this pr https://github.com/intel/neural-speed/pull/202 please try newest branch.
@a32543254 It does get fixed in single generate call. But for the cont. batching in ModelServer, the issue still exists. Here is the log after running the test_model_server.py.
=======REFERENCE RESULTS FOR COMPARISON=========
=======FOR LOOP GREEDY SEARCH GENERATION RESULTS WITH MHA==========
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000
init: n_ctx = 0
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ctx size = 4427.43 MB
load: scratch0 = 4096.00 MB
load: scratch1 = 2048.00 MB
load: scratch2 = 4096.00 MB
load: mem required = 14667.43 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 552.00 MB
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
What's your favorite animal?
Unterscheidung between different types of animals is difficult, as different people may have different preferences and cultural backgrounds can also play a role in shaping one's preferences. However, some animals are generally considered to be popular or iconic, and these are often the ones that people mention as their favorites.
Some of the most popular animals that people tend to mention as their favorites include:
=======FOR LOOP BEAM SEARCH GENERATION RESULTS WITH MHA==========
Will start to reinit model from bin due to different max request num.
beam_size: 4, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 1, scratch_size_ratio: 1.000
model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000
init: n_ctx = 0
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ctx size = 4427.43 MB
load: scratch0 = 16384.00 MB
load: scratch1 = 8192.00 MB
load: scratch2 = 16384.00 MB
load: mem required = 45387.43 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 2208.00 MB
What's your favorite animal? �������������������������������������������������������������������������������������������������������������������������������
Hi, @jiafuzha, sorry for the late response.
�
in your test_model_server.py
script is not related to cont-batching
or ModelServer
. It just has different num_beams
which is 4 when compared to your first " single generate call". And in fact, it is still a "single generate call".�
mean?
I reproduce your issue when num_beams=4, do_sample=False, max_new_token=10
. The generated_tokens (with prompt) are [[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243]]
. Let's pick up the last one token 243
, it maps to (from llama2 tokenizer.json):
And seems it's a hexadecimal representation. However, I'm not a big fan of it. So I don't know why these hexadecimal representations exist.Is it caused by our c++ beam search
, model_eval
or just model
itself?
yes, our c++ beam_search
is not as same as transformers
, but the results should not be much different since we refer to their Python implementation. For example, you can check the beam search
results between PyTorch FP32 and NS FP32:
env: INTEL(R) XEON(R) PLATINUM 8580
, latest NS
and ITREX
(both build from source). remember to clean up the runtime_outs folder when you change quant-related args
PyTorch:
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
print(generate_ids)
print(tokenizer.decode(generate_ids, skip_special_tokens=True))
And it outputs like:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243])
What's your favorite animal? ���������
NS
:
model.init(model_name, use_quant=False)
....same code as above
And it outputs like:
[[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243]]
What's your favorite animal? ���������
They are the same! And the FP32 model outputs �
(maybe llama2 has illusion when meets your prompt...)
Use ITREX
RTN algo instead of NS
to quant the model and generate by transformers
. You can refer to this example for how quant and save low-bits model from ITREX
. The quant cmd is: python run_generation.py --model xxx --woq --woq_algo Rtn --bits 4 --weight_dtype int4_clip --compute_dtype int8 --group_size 32 --benchmark
. Once you finish, you will see the low-bits model in the saved_results
folder.
After running:
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
print(tokenizer.decode(generate_ids, skip_special_tokens=True))
You will see:
What's your favorite animal? ���������
RTN
quant args. Let use per-channel this time. the python cmd is : model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8", group_size=-1)
. And the output is: What's your favorite animal? Why? (Submitted 10:
. The result seems a bit more reasonable.So, I think this issue is more like a model
related problem (RTN quantization, illusion, etc.). If you still meet this generation problem after trying more models or more quant algorithms (gptq, awq, auto-round), please let me know. Thanks.
@zhentaoyu thanks for the detailed response. I just got some new things to share with you.
"What's your favorite animal? 🐰🐶🐱🐷
My favorite animal is the penguin! 🐧 I think they're so cute and funny, and they're great"
tokens: tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 13, 13, 3421, 25448, 13019, 338, 278, 282, 19636, 262, 29991, 29871, 243, 162, 147, 170, 306, 1348, 896, 29915, 276, 577, 274, 1082, 322, 2090, 1460, 29892, 322, 896, 29915, 276, 2107])
[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243]
By the way, another case of garbled character is with prompt, 'what's your favorite food?'. ns: [1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 29871, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243, 162, 168, 171, 243, 162, 143, 177, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243] What's your favorite food? �������������������������������������������������
vanilla transformers: tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 13, 13, 3421, 25448, 9687, 338, 282, 24990, 29889, 306, 5360, 278, 10296, 310, 278, 2181, 275, 2272, 2181, 504, 29892, 18806, 29891, 6454, 1219, 12507, 346, 29892, 322, 286, 2152, 287, 286, 2112, 29920, 598, 13520, 923, 968, 29889, 739, 29915, 29879, 278, 4922, 13016, 9687, 29889, 13, 13]) What's your favorite food?
My favorite food is pizza. I love the combination of the crispy crust, tangy tomato sauce, and melted mozzarella cheese. It's the perfect comfort food.
beam search
in NS has not repetition penalty
, it only has length_penalty
(prefer long or short sequence results)
- Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
beam search
in NS has notrepetition penalty
, it only haslength_penalty
(prefer long or short sequence results)
NS result is from "model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")".
Member
I see. You can use model.init(model_name. use_quant=False)
to compare your vanilla transformers results.
yes, with fp32, I can get correct result from ns.
I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`
yes, with fp32, I can get correct result from ns.
I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`
Hi, @jiafuzha, it's different model_id and weight dtype.
@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH
will get more reasonable results.
ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a
with max_new_tokens =50
yes, with fp32, I can get correct result from ns. I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)
Hi, @jiafuzha, it's different model_id and weight dtype.
@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline
ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH
will get more reasonable results. ITREX RTN QUANT follow this example. And the results is likeWhat's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a
withmax_new_tokens =50
sorry, I copied wrong code. I was actually using ,
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
p = "What's your favorite food?"
quantization_config = QuantoConfig(weights="int4")
....
...
I got "tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 26833, 338, 282, 24990, 29991, 29871, 243, 162, 144, 152, 243, 162, 148, 143, 396, 1181, 397, 347, 396, 29886, 24990, 396, 29891, 398, 2]) What's your favorite food? Mine is pizza! 🍕👌 #foodie #pizza #yum"
@zhentaoyu @a32543254 any more comments?
Hi, @jiafuzha, our NS
RTN quant has some regressions which need to be fixed and aligned (for example, we quant lm_head
and token_embedding
for llama
). Will let you know if we fix it. Thanks.
any update on this?
Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.
Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.
no worries, looking forward to your fix.
` model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = Model() model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")
tokens = tokenizer("What's your favorite animal?", return_tensors='pt').input_ids
outputs = model.generate(tokens, num_beams=2, do_sample=False, max_new_tokens=10) text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(text) ` With above code, I got below garbled characters. "What's your favorite animal? ���������"
If I generate without beam search, I can get expected result.
outputs = model.generate(tokens)
"What's your favorite animal? everybody has a favorite animal, and it's a"