Qwen-14B-Chat inference repeat

Storm0921 commented 8 months ago

When i use python_api_example or streaming_llm python scripts to inference Qwen-14B-Chat，the first two questions were outputted normally, but the third question has been repeating itself since then. I find it strange and can stably reproduce this error. And it seems like something has been repeating the prompts all along.

my RAG prompt length=654

a32543254 commented 8 months ago

Repeat is kind of normal in LLM models. Here is some possible solution:

Try to use do_sample=True in generate api ?
change woq_config args : compute dtype from int8 to bf16
Increase reptition_penalty value
Increase top_k value

Storm0921 commented 8 months ago

Repeat is kind of normal in LLM models. Here is some possible solution:

Try to use do_sample=True in generate api ?

change woq_config args : compute dtype from int8 to bf16

Increase reptition_penalty value

Increase top_k value

I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} "，but failed

Storm0921 commented 8 months ago

Repeat is kind of normal in LLM models. Here is some possible solution:

Try to use do_sample=True in generate api ?

change woq_config args : compute dtype from int8 to bf16

Increase reptition_penalty value

Increase top_k value

Can you help me to sovle this problem?

Zhenzhong1 commented 8 months ago

I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} "，but failed

@fengenbao Hi, Baichuan does not need to add extra prompt templates.

Try to use do_sample=True in generate api ? change woq_config args : compute dtype from int8 to bf16 Increase reptition_penalty value Increase top_k valueCan you help me to sovle this problem?

These are all input args that you can modify.

do_sample = True is a args of the API. For example: outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)

Please check this README.md. https://github.com/intel/neural-speed/tree/main

woq_config please check this: https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code.

reptition_penalty & top_k value please check https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

Storm8878 commented 8 months ago

I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} "，but failed

@fengenbao Hi, Baichuan does not need to add extra prompt templates.

Try to use do_sample=True in generate api ? change woq_config args : compute dtype from int8 to bf16 Increase reptition_penalty value Increase top_k valueCan you help me to sovle this problem?

These are all input args that you can modify.

do_sample = True is a args of the API. For example: outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)

Please check this README.md. https://github.com/intel/neural-speed/tree/main

woq_config please check this: https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code.

reptition_penalty & top_k value please check https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md

Thanks for your attention! I classify this question in detail in another issue #1148, please help check if the parameters are set correctly

a32543254 commented 3 months ago

already track with issue https://github.com/intel/intel-extension-for-transformers/issues/1148

intel / intel-extension-for-transformers

Qwen-14B-Chat inference repeat #1144