generator.get_next_token always returns zero

SreenilaRajesh commented 1 month ago

generator.get_output(“logits”).squeeze also returns an array of zeros

this results in a blank Output everytime ————————-

model used : phi3-mini-128k-instruct-cpu-int4-rtn-block-32.onnx

cc @kunal-vaishnavi

yufenglee commented 1 month ago

@SreenilaRajesh , could you please share more information to repro, like the script?

kunal-vaishnavi commented 1 month ago

In addition to a repro script, what was the prompt that you tried and what version of ONNX Runtime GenAI are you using?

SreenilaRajesh commented 1 month ago

@yufenglee @kunal-vaishnavi

@SreenilaRajesh , could you please share more information to repro, like the script? I’m trying the below script with an offline model, passing the model folder in the command line (-m) Script : https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py Model : phi3-mini-128k-instruct-cpu-int4-rtn-block-32.onnx https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32 The weight matrices seem to be empty when visualizing the model with https://netron.app/

In addition to a repro script, what was the prompt that you tried and what version of ONNX Runtime GenAI are you using? testing with simple prompts like “tell me a joke”, “tell a short story” ONNX Runtime GenAI - 0.3.0

kunal-vaishnavi commented 3 weeks ago

I’m trying the below script with an offline model, passing the model folder in the command line (-m) Script : https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py Model : phi3-mini-128k-instruct-cpu-int4-rtn-block-32.onnx https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32 testing with simple prompts like “tell me a joke”, “tell a short story”

I am able to use the above links to run the model and get outputs with your prompts with ONNX Runtime GenAI v0.3.0.

$ python3 phi3-qa.py -m ./phi3_mini_128k_cpu_int4_rtn_block_32
Input: tell me a joke

Output:  Why don't scientists trust atoms?

Because they make up everything!

Input: tell a short story

Output: In a quaint village bordered by a lush forest, there lived a young girl named Lily who had the unique ability to communicate with animals. Her gift was a secret, known only to her and the wise old owl, Oliver, who watched over her. One day, a mysterious illness befell the animals, causing them to fall into a deep slumber.

Lily, determined to save her friends, embarked on a quest to find the rare Moonflower, a magical plant said to cure any ailment. With Oliver' funny yet wise guidance, she journeyed through the enchanted forest, facing challenges and meeting creatures who tested her courage and kindness.

After overcoming trials and earning the trust of the forest's guardian, a majestic stag named Thunder, Lily finally found the Moonflower hidden in a clearing bathed in silver moonlight. With the flower in hand, she returned to her village, where the wise healer, an old woman named Elara, prepared a potion using the Moonflower's essence.

The potion was successful, and as the first rays of dawn kissed the sky, the animals awoke from their slumber. Grateful and relieved, they thanked Lily for her bravery and selflessness. From that day on, Lily's secret gift was no longer a secret, and she became the village's cherished protector, always ready to listen and help her animal friends.

There could be multiple reasons for why the outputs are empty.

Can you share your machine's details? It is possible that your setup is not supported. Here's a simple Python script to provide the necessary info.

import platform
print(platform.machine())
print(platform.version())
print(platform.platform())
print(platform.system())
print(platform.processor())

What steps did you perform to download the model? There are multiple methods to download (e.g. clicking the download button on Hugging Face per file, using git clone, using huggingface-cli, etc) and each method can have different problems.
What packages are in your environment? It is possible that the wrong ONNX Runtime version is being referenced because currently ONNX Runtime is bundled within the ONNX Runtime GenAI package.

The weight matrices seem to be empty when visualizing the model with https://netron.app/

This is expected because all of the weights are stored in the external data file called phi3-mini-128k-instruct-cpu-int4-rtn-block-32.onnx.data. The ONNX model (phi3-mini-128k-instruct-cpu-int4-rtn-block-32.onnx) references the external data file when loading. Netron does not show weights that are stored in an external data file.

SreenilaRajesh commented 3 weeks ago

Can you share your machine's details? print(platform.machine()) -- AMD64 print(platform.version()) -- 10.0.17763 print(platform.platform()) -- Windows-10-10.0.17763-SP0 print(platform.system()) -- Windows print(platform.processor()) -- Intel64 Family 6 Model 85 Stepping 7, GenuineIntel

What steps did you perform to download the model? git clone

What packages are in your environment? onnx - 1.16.2 onnxruntime - 1.18.1 onnxruntime-genai - 0.3.0

Tried after removing onnx and onnxruntime, still got the same error

@kunal-vaishnavi

kunal-vaishnavi commented 2 weeks ago

Here are a couple of debugging things to try.

Can you verify that you downloaded the model correctly via git clone and that you can see the same file sizes that are shown on Hugging Face? You will need git lfs installed for the weights to download correctly.

Without git lfs installed before cloning, some users have reported abnormal behavior when trying to run inference. If you ran git clone without git lfs installed, please delete your cloned repo, install git lfs, and then try cloning again.

Can you insert the below line here, insert a break after this line, and run the script again?

og.set_log_options(enabled=True, model_input_values=True, model_output_values=True)

This will help narrow down whether the error is coming from ONNX Runtime or ONNX Runtime GenAI. If the model output values look abnormal when printed (e.g. a bunch of NaN values), then the issue is with ONNX Runtime.

After trying the above steps with ONNX Runtime GenAI v0.3.0, can you upgrade to ONNX Runtime GenAI v0.4.0 and try again?

SreenilaRajesh commented 2 weeks ago

Hi @kunal-vaishnavi

Used cloned repo after installing git-lfs - didn’t work
It prints additional characters along with model_input_values, input_ids, etc,.
Upgrade to v0.4.0 - didn’t work

kunal-vaishnavi commented 2 weeks ago

It prints additional characters along with model_input_values, input_ids, etc,.

You can add ansi_tags=False in the og.set_log_options to not print the additional characters. Those are used for pretty printing.

Can you share what is in the model_output_values? What are the logits returned?

SreenilaRajesh commented 2 weeks ago

Hi @kunal-vaishnavi please find the model output values and logits below

kunal-vaishnavi commented 1 week ago

From looking at this image, it appears that the present KV caches are not returning any values after the first two layers. The initial values in present.2.key and present.2.value are all zeros instead of floating-point values, and the pattern continues for the subsequent layers shown. The logits are also returning only zeros, which explains why generator.get_next_token() is always returning zero. This behavior seems to indicate that there is an issue with ONNX Runtime and not ONNX Runtime GenAI.

To reproduce the bug, I created a new D16ds_v4 VM in Azure. I verified that the VM has the same hardware as your machine. env

In the VM, I installed Miniconda and created a new Conda environment called genuineintel. Then I installed protobuf, onnxruntime, and onnxruntime-genai with pip. To avoid the Windows conda import error, I also installed vs2015_runtime with conda-forge. Here are the commands I used.

(base) $ conda create --name genuineintel python=3.9
(base) $ conda activate genuineintel
(genuineintel) $ pip install protobuf==3.20.2
(genuineintel) $ pip install onnxruntime
(genuineintel) $ pip install onnxruntime-genai
(genuineintel) $ conda install conda-forge::vs2015_runtime

Here's the full list of packages installed in my environment after these steps.

package_list

Then I created a folder called dev and another folder called phi3_mini_128k inside the dev folder. I manually downloaded all of the Phi-3 mini 128K files for INT4 CPU from this folder into the phi3_mini_128k folder. I also downloaded the phi3-qa.py example into the dev folder.

Finally, I ran the phi3-qa.py example afterwards and saw no issues with the output.

output

Given that I see no issues with the output, it does not appear to be an issue with ONNX Runtime either. I think there either might be an issue with the environment you're testing with, or you may be running out of memory as the ONNX model is running. Could you try creating a fresh environment and follow these same steps? If that does not fix the issue, can you share the specs for your machine? How much memory does your machine have? Do you have another machine you can test with?

kunal-vaishnavi commented 6 days ago

As an update on this issue, an internal user faced the same issue on a Linux machine in a Conda environment. The issue was resolved by uninstalling all possible ONNX Runtime and ONNX Runtime GenAI packages installed and then only installing ONNX Runtime GenAI manually (as v0.4.0 will install ONNX Runtime for you).

# Uninstall all ONNX Runtime packages:
# 1. `onnxruntime` (stable CPU package)
# 2. `onnxruntime-gpu` (stable GPU package)
# 3. `ort-nightly` (nightly CPU package)
# 4. `ort-nightly-gpu` (nightly GPU package)
# Only one of these should be installed at any time.
# Please ensure all are uninstalled and not visible via `pip list` after this command.
$ pip uninstall -y onnxruntime onnxruntime-gpu ort-nightly ort-nightly-gpu

# Uninstall all ONNX Runtime GenAI packages:
# 1. `onnxruntime-genai` (stable CPU package)
# 2. `onnxruntime-genai-cuda` (stable CUDA package)
# 3. `onnxruntime-genai-directml` (stable DirectML package)
# Only one of these should be installed at any time.
# Please ensure all are uninstalled and not visible via `pip list` after this command.
$ pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml

# Install only ONNX Runtime GenAI manually
# This command should also install ONNX Runtime as a dependency.
# For CPU, please ensure only `onnxruntime-genai` and `onnxruntime` are installed and visible via `pip list` after this command.
$ pip install onnxruntime-genai

microsoft / onnxruntime-genai

generator.get_next_token always returns zero #776