Closed Storm0921 closed 3 months ago
Repeat is kind of normal in LLM models. Here is some possible solution:
Repeat is kind of normal in LLM models. Here is some possible solution:
- Try to use do_sample=True in generate api ?
- change woq_config args : compute dtype from int8 to bf16
- Increase reptition_penalty value
- Increase top_k value
I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "
Repeat is kind of normal in LLM models. Here is some possible solution:
- Try to use do_sample=True in generate api ?
- change woq_config args : compute dtype from int8 to bf16
- Increase reptition_penalty value
- Increase top_k value
Can you help me to sovle this problem?
I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} ",but failed
@fengenbao Hi, Baichuan does not need to add extra prompt templates.
Try to use do_sample=True in generate api ? change woq_config args : compute dtype from int8 to bf16 Increase reptition_penalty value Increase top_k valueCan you help me to sovle this problem?
These are all input args that you can modify.
do_sample = True
is a args of the API. For example:
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)
Please check this README.md. https://github.com/intel/neural-speed/tree/main
woq_config
please check this: https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code.
reptition_penalty
& top_k
value please check https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md
I think I did not add duplicate questions caused by Qwen's Prompt Template format. By the way, how should Baichuan's Prompt Template be written?I tried BAICHUAN_PROMPT_FORMAT = "{prompt} ",but failed
@fengenbao Hi, Baichuan does not need to add extra prompt templates.
Try to use do_sample=True in generate api ? change woq_config args : compute dtype from int8 to bf16 Increase reptition_penalty value Increase top_k valueCan you help me to sovle this problem?
These are all input args that you can modify.
do_sample = True
is a args of the API. For example: outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30, do_sample=True)Please check this README.md. https://github.com/intel/neural-speed/tree/main
woq_config
please check this: https://github.com/intel/intel-extension-for-transformers/blob/main/docs/weightonlyquant.md#llm-runtime-example-code.
reptition_penalty
&top_k
value please check https://github.com/intel/neural-speed/blob/main/docs/advanced_usage.md
Thanks for your attention! I classify this question in detail in another issue #1148, please help check if the parameters are set correctly
already track with issue https://github.com/intel/intel-extension-for-transformers/issues/1148
When i use python_api_example or streaming_llm python scripts to inference Qwen-14B-Chat,the first two questions were outputted normally, but the third question has been repeating itself since then. I find it strange and can stably reproduce this error. And it seems like something has been repeating the prompts all along.
my RAG prompt length=654