intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
348 stars 38 forks source link

Garbled characters with beam search #215

Open jiafuzha opened 7 months ago

jiafuzha commented 7 months ago

` model_name = "meta-llama/Llama-2-7b-chat-hf" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = Model() model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")

tokens = tokenizer("What's your favorite animal?", return_tensors='pt').input_ids

outputs = model.generate(tokens, num_beams=2, do_sample=False, max_new_tokens=10) text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(text) ` With above code, I got below garbled characters. "What's your favorite animal? ���������"

If I generate without beam search, I can get expected result. outputs = model.generate(tokens) "What's your favorite animal? everybody has a favorite animal, and it's a"

a32543254 commented 7 months ago

we have fixed it on this pr https://github.com/intel/neural-speed/pull/202 please try newest branch.

jiafuzha commented 7 months ago

@a32543254 It does get fixed in single generate call. But for the cont. batching in ModelServer, the issue still exists. Here is the log after running the test_model_server.py.

=======REFERENCE RESULTS FOR COMPARISON========= =======FOR LOOP GREEDY SEARCH GENERATION RESULTS WITH MHA========== ARCH_REQ_XCOMP_PERM XTILE_DATA successful. AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1 beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000 model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin Loading the bin file with NE format... load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000 init: n_ctx = 0 init: n_embd = 4096 init: n_mult = 256 init: n_head = 32 init: n_head_kv = 32 init: n_layer = 32 init: n_rot = 128 init: n_ff = 11008 init: n_parts = 1 load: ctx size = 4427.43 MB load: scratch0 = 4096.00 MB load: scratch1 = 2048.00 MB load: scratch2 = 4096.00 MB load: mem required = 14667.43 MB (+ memory per state) ................................................................................................... model_init_from_file: support_bestla_kv = 1 model_init_from_file: kv self size = 552.00 MB ARCH_REQ_XCOMP_PERM XTILE_DATA successful. What's your favorite animal? Unterscheidung between different types of animals is difficult, as different people may have different preferences and cultural backgrounds can also play a role in shaping one's preferences. However, some animals are generally considered to be popular or iconic, and these are often the ones that people mention as their favorites.

Some of the most popular animals that people tend to mention as their favorites include:

  1. Dogs: Many people consider dogs to be their favorite animals, and it's not hard to see why. Dogs are known for their loyalty, affection, and playful nature, making them

    =======FOR LOOP BEAM SEARCH GENERATION RESULTS WITH MHA========== Will start to reinit model from bin due to different max request num. beam_size: 4, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 1, scratch_size_ratio: 1.000 model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin Loading the bin file with NE format... load_ne_hparams 0.hparams.n_vocab = 32000
    load_ne_hparams 1.hparams.n_embd = 4096
    load_ne_hparams 2.hparams.n_mult = 256
    load_ne_hparams 3.hparams.n_head = 32
    load_ne_hparams 4.hparams.n_head_kv = 32
    load_ne_hparams 5.hparams.n_layer = 32
    load_ne_hparams 6.hparams.n_rot = 128
    load_ne_hparams 7.hparams.ftype = 0
    load_ne_hparams 8.hparams.max_seq_len = 0
    load_ne_hparams 9.hparams.alibi_bias_max = 0.000
    load_ne_hparams 10.hparams.clip_qkv = 0.000
    load_ne_hparams 11.hparams.par_res = 0
    load_ne_hparams 12.hparams.word_embed_proj_dim = 0
    load_ne_hparams 13.hparams.do_layer_norm_before = 0
    load_ne_hparams 14.hparams.multi_query_group_num = 0
    load_ne_hparams 15.hparams.ffn_hidden_size = 11008
    load_ne_hparams 16.hparams.inner_hidden_size = 0
    load_ne_hparams 17.hparams.n_experts = 0
    load_ne_hparams 18.hparams.n_experts_used = 0
    load_ne_hparams 19.hparams.n_embd_head_k = 0
    load_ne_hparams 20.hparams.norm_eps = 0.000010
    load_ne_hparams 21.hparams.freq_base = 10000.000
    load_ne_hparams 22.hparams.freq_scale = 1.000
    load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
    load_ne_hparams 24.hparams.original_max_position_embeddings = 0
    load_ne_hparams 25.hparams.use_yarn = 0
    load_ne_vocab 26.vocab.bos_token_id = 1
    load_ne_vocab 27.vocab.eos_token_id = 2
    load_ne_vocab 28.vocab.pad_token_id = 2
    load_ne_vocab 29.vocab.sep_token_id = -1
    init: n_vocab = 32000 init: n_ctx = 0 init: n_embd = 4096 init: n_mult = 256 init: n_head = 32 init: n_head_kv = 32 init: n_layer = 32 init: n_rot = 128 init: n_ff = 11008 init: n_parts = 1 load: ctx size = 4427.43 MB load: scratch0 = 16384.00 MB load: scratch1 = 8192.00 MB load: scratch2 = 16384.00 MB load: mem required = 45387.43 MB (+ memory per state) ................................................................................................... model_init_from_file: support_bestla_kv = 1 model_init_from_file: kv self size = 2208.00 MB What's your favorite animal? �������������������������������������������������������������������������������������������������������������������������������

zhentaoyu commented 7 months ago

Hi, @jiafuzha, sorry for the late response.

  1. The in your test_model_server.py script is not related to cont-batching or ModelServer. It just has different num_beams which is 4 when compared to your first " single generate call". And in fact, it is still a "single generate call".
  2. What does the mean? I reproduce your issue when num_beams=4, do_sample=False, max_new_token=10. The generated_tokens (with prompt) are [[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243]]. Let's pick up the last one token 243, it maps to (from llama2 tokenizer.json): image And seems it's a hexadecimal representation. However, I'm not a big fan of it. So I don't know why these hexadecimal representations exist.
  3. Is it caused by our c++ beam search, model_eval or just model itself?

    • yes, our c++ beam_search is not as same as transformers, but the results should not be much different since we refer to their Python implementation. For example, you can check the beam search results between PyTorch FP32 and NS FP32: env: INTEL(R) XEON(R) PLATINUM 8580, latest NS and ITREX (both build from source). remember to clean up the runtime_outs folder when you change quant-related args PyTorch:

      from intel_extension_for_transformers.transformers import AutoModelForCausalLM
      model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
      generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
      print(generate_ids)
      print(tokenizer.decode(generate_ids, skip_special_tokens=True))

      And it outputs like: tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243]) What's your favorite animal? ��������� NS:

      model.init(model_name, use_quant=False)
      ....same code as above

      And it outputs like: [[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243]] What's your favorite animal? ��������� They are the same! And the FP32 model outputs (maybe llama2 has illusion when meets your prompt...)

    • Use ITREX RTN algo instead of NS to quant the model and generate by transformers. You can refer to this example for how quant and save low-bits model from ITREX. The quant cmd is: python run_generation.py --model xxx --woq --woq_algo Rtn --bits 4 --weight_dtype int4_clip --compute_dtype int8 --group_size 32 --benchmark. Once you finish, you will see the low-bits model in the saved_results folder.
      After running:

      from intel_extension_for_transformers.transformers import AutoModelForCausalLM
      model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
      generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
      print(tokenizer.decode(generate_ids, skip_special_tokens=True))

    You will see: What's your favorite animal? ���������

    • Change the RTN quant args. Let use per-channel this time. the python cmd is : model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8", group_size=-1). And the output is: What's your favorite animal? Why? (Submitted 10:. The result seems a bit more reasonable.

So, I think this issue is more like a model related problem (RTN quantization, illusion, etc.). If you still meet this generation problem after trying more models or more quant algorithms (gptq, awq, auto-round), please let me know. Thanks.

jiafuzha commented 7 months ago

@zhentaoyu thanks for the detailed response. I just got some new things to share with you.

  1. I am able to get correct result after I changed max_new_tokens from 10 to 50 with both vanilla transfomers and itrex.

"What's your favorite animal? 🐰🐶🐱🐷

My favorite animal is the penguin! 🐧 I think they're so cute and funny, and they're great"

tokens: tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 13, 13, 3421, 25448, 13019, 338, 278, 282, 19636, 262, 29991, 29871, 243, 162, 147, 170, 306, 1348, 896, 29915, 276, 577, 274, 1082, 322, 2090, 1460, 29892, 322, 896, 29915, 276, 2107])

  1. with neuralspeed, however, I still got garbled characters. After checking the token IDs, I found most of tokens are just repeating themselves. Do you think it's related to the lack of repetition penalty in ns?

[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243]

jiafuzha commented 7 months ago

By the way, another case of garbled character is with prompt, 'what's your favorite food?'. ns: [1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 29871, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243, 162, 168, 171, 243, 162, 143, 177, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243] What's your favorite food? �������������������������������������������������

vanilla transformers: tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 13, 13, 3421, 25448, 9687, 338, 282, 24990, 29889, 306, 5360, 278, 10296, 310, 278, 2181, 275, 2272, 2181, 504, 29892, 18806, 29891, 6454, 1219, 12507, 346, 29892, 322, 286, 2152, 287, 286, 2112, 29920, 598, 13520, 923, 968, 29889, 739, 29915, 29879, 278, 4922, 13016, 9687, 29889, 13, 13]) What's your favorite food?

My favorite food is pizza. I love the combination of the crispy crust, tangy tomato sauce, and melted mozzarella cheese. It's the perfect comfort food.

zhentaoyu commented 7 months ago
  1. Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
  2. beam search in NS has not repetition penalty, it only has length_penalty (prefer long or short sequence results)
jiafuzha commented 7 months ago
  1. Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
  2. beam search in NS has not repetition penalty, it only has length_penalty (prefer long or short sequence results)

NS result is from "model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")".

zhentaoyu commented 7 months ago

Member

I see. You can use model.init(model_name. use_quant=False) to compare your vanilla transformers results.

jiafuzha commented 7 months ago

yes, with fp32, I can get correct result from ns.

I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.

`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`

zhentaoyu commented 7 months ago

yes, with fp32, I can get correct result from ns.

I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.

`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`

Hi, @jiafuzha, it's different model_id and weight dtype.

@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH will get more reasonable results. ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a with max_new_tokens =50

jiafuzha commented 7 months ago

yes, with fp32, I can get correct result from ns. I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result. from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)

Hi, @jiafuzha, it's different model_id and weight dtype.

@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH will get more reasonable results. ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a with max_new_tokens =50

sorry, I copied wrong code. I was actually using ,

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
p = "What's your favorite food?"
quantization_config = QuantoConfig(weights="int4")
....
...

I got "tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 26833, 338, 282, 24990, 29991, 29871, 243, 162, 144, 152, 243, 162, 148, 143, 396, 1181, 397, 347, 396, 29886, 24990, 396, 29891, 398, 2]) What's your favorite food? Mine is pizza! 🍕👌 #foodie #pizza #yum"

jiafuzha commented 7 months ago

@zhentaoyu @a32543254 any more comments?

zhentaoyu commented 7 months ago

Hi, @jiafuzha, our NS RTN quant has some regressions which need to be fixed and aligned (for example, we quant lm_head and token_embedding for llama). Will let you know if we fix it. Thanks.

jiafuzha commented 6 months ago

any update on this?

zhentaoyu commented 6 months ago

Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.

jiafuzha commented 6 months ago

Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.

no worries, looking forward to your fix.