Closed Tabrizian closed 5 months ago
Hi @Tabrizian, yes the batch is currently padded with the EOS token, once it reaches EOS and others keep generating. Can you describe your scenario in a little more detail?
Padding with pad_token_id/eos_token_id is fine. For benchmarking purposes, I was looking for ignore_eos
so that the model keeps generating new tokens even when EOS
is seen. There is similar functionality supported in vLLM: https://github.com/vllm-project/vllm/blob/606625329648e6eff1883e23040adfad82f219cf/vllm/sampling_params.py#L81-L82 and Hugging Face APIs.
This is mainly used in creating predictable OSL when benchmarking different models. Without this parameter it is difficult to know when the model will reach EOS and makes performance benchmarking complicated.
Thank you for sharing more details @Tabrizian. Can you share how to configure this in HuggingFace?
For hugging face it is generally handled in the layer above the generate
function:
new_tokens = model.generate(input_ids, max_new_tokens=1020)
for token in new_tokens:
if token == tokenizer.eos_token_id:
...
And for new tokens we would check whether it is eos or not but I think the generate
function is guaranteed to generate 1020 new tokens.
Hi @Tabrizian, Using the generate() API, you can achieve this functionlity by setting the min_length and the max_length to the same value. For example, using the phi3-qa.py script, you can see that the length is fixed to the value that you specify.
python phi3-qa.py -i 512 -l 512 -g -m ..\..\models\microsoft\phi3-mini-4k\cpu-int4-rtn-block-32-acc-level-4
Input: Hello!
Output: Hello! How can I assist you today? Whether you have questions or need help with something, I'm here to support you. Have a great day!
Here are some examples of how I can assist you:
1. Answering general knowledge questions
2. Providing explanations on various topics
3. Assisting with problem-solving
4. Offering guidance on using technology and devices
5. Sharing helpful tips and tricks
6. Giving advice on everyday life situations
7. Assisting with language-related inquiries
I'm ready to help with any of these or other requests you may have!
Note: As an AI, I'm here to provide information and support, but I'm not able to have personal experiences or emotions. However, I'll do my best to make our interaction engaging and helpful! If you have any specific questions or need assistance, feel free to ask.
Remember, I'm here to help you make the most out of your day! Let me know how I can assist you further. Have a wonderful day ahead!
If you're looking for a friendly conversation, feel free to share your thoughts or ask questions about various topics. I'm here to listen and provide information to the best of my abilities.
If you need help with something specific, please let me know! Whether it's a technical issue, a general inquiry, or just a casual chat, I'm here to help. Have a great day!
If you're interested in learning something new or expanding your knowledge, I can provide information on a wide range of subjects. Just let me know what you're curious about!
If you're experiencing any difficulties or challenges, I'm here to offer guidance and support. Whether it's a problem you're facing or a goal you're working towards, I'll do my best to assist you.
Remember, I'm here to help you make the most out of your day! Feel free to ask any questions or share your thoughts. Have a fantastic day ahead!
If you're looking for a friendly conversation, feel free to share your thoughts or ask questions about various topics. I'm here to listen and provide information to the best of my abilities.
If you
Prompt length: 10, New tokens: 502, Time to first: 1.41s, Prompt tokens per second: 7.09 tps, New tokens per second: 10.91 tps
This works for me. Thanks @natke!
Is there a way to ignore eos when using this onnxruntime-genai? It looks like once one of the requests in a batch reaches EOS token it keeps generating the same token in the batch.