Open jackylu0124 opened 5 months ago
I have the same issue. I get similar nonsensical output when using either the 4k-model or the 128k-model via the Onnx-Runtime with a long user prompt and setting the search option max_length to 3000. When using a shorter user prompt the output is as expected. Using the same user prompt on https://ai.azure.com/explore/models/Phi-3-mini-128k- instruct/version/7/registry/azureml delivers correct results, so the model itself should not be the cause of the problem.
Tried with versions 0.2.0 and 0.3.0-rc2 of the library Microsoft.ML.OnnxRuntimeGenAI.DirectML.
P.S.: Thank you for this nice library, it is awesome to be able to run a SLM locally that easily.
A quick follow-up to this issue. I would really appreciate any help or insights on this issue!
I am seeing a similar issue when running on CPU
Hi @jackylu0124, @bkaruman, @AMehlem, I have reproduced your issue on CPU. We will investigate.
Hi @natke, thank you for the update, I appreciate it.
I get the same issue if the conversation history goes above 2k (approximately) using .NET nugget package Microsoft.ML.OnnxRuntimeGenAI version 0.3.0. I am using the Phi-3-mini-4k-instruct-onnx. As a work around I truncate the conversation history (MaxTokenLength which is 4096 - 2500) and the issue goes away. I don't truncate the tokens but rather the conversation that the limit falls under.
EDIT: I can reproduce this with ollama version is 0.2.1
and it's Phi3 model phi3:mini. I get responses that are not entirely reflected in the conversation.
This should be resolved with the PR: https://github.com/microsoft/onnxruntime-genai/pull/802
Hello @baijumeswani, thank you for the update does this mean that we have to regenerate our onnx models? I also have used the model builder for the Phi-3.5 model.
I was running experiments with the HotpotQA dataset. I also observe that the models performance drops considerably, if the number of tokens exceed ~2500.
(Y-Axis: Log-probability to generate the correct question, X-Axis: Number of Tokens in question)
Hello @MaxAkbar, I tested the latest onnxruntime-genai package with Phi-3.5 model, and it works fine when the input prompt is greater than 2k. Can you please try that?
Thank you @apsonawane , I also created an onnx model but will run some tests this weekend.
I've been experiencing a similar issue when fine-tuning phi3 and phi3.5 models as well. Produces a lot of tokens at the end that are gibberish even after fine-tuning.
Looks like this has been solved with the latest ONNX release, but fine-tuning these ONNX models by converting them to torch is really tricky. Any solution for that?
I am running the
Phi-3-mini-4k-instruct-onnx
model on desktop CPU, and one behavior I have noticed is that, after the back and forth conversation is longer than half of the context window length (in other words, longer than 2048 tokens), the model starts generating nonsensical and unreadable results. This is weird because I would expect the model to continue generating readable results at least to the point that's closer to the end of the context window length. I would really appreciate any insights on this problematic behavior as well as any ways that I can fix this issue. Thanks a lot in advance!To reproduce this issue, you can run the example
phi3-qa.py
script (https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py) (with command likepython phi3-qa.py -m .\Phi-3-mini-4k-instruct-onnx\cpu_and_mobile\cpu-int4-rtn-block-32-acc-level-4
) with the if-block settingsearch_options['max_length'] = 2048
commented out to allow longer input and also replace theprompt
in the lineinput_tokens = tokenizer.encode(prompt)
(https://github.com/microsoft/onnxruntime-genai/blob/6be88357c924de71da67f96a70b3c0f45803249f/examples/python/phi3-qa.py#L38) with the following conversation string (which is a conversation between the user and the assistant, and the string corresponds to 2271 tokens):For greater readability, the conversation string above corresponds the following dialog:
and then the model will generate the following nonsensical/unreadable output: