microsoft / onnxruntime-inference-examples

Examples for using ONNX Runtime for machine learning inferencing.
MIT License
1.07k stars 312 forks source link

simple phi3 chat example #424

Closed guschmue closed 2 months ago

bekatan commented 3 weeks ago

Error: previous buffer is not registered

The example chatbot can retain context from previous messages in the chat if the new message is sent with "Ctrl + Enter". To my understanding this way the LLM receives a bigger input_ids with tokens that represent previous messages as well as the new message. When I try doing "Ctrl+Enter" for the second message, after calculating and showing the first response token, I get the Error: previous buffer is not registered. Also, during the inference I noticed that the 3D graph in the Task manager/Permofmance/GPU starts showing a steep rise up to 100%, at which point the mentioned error is thrown. The dedicated GPU memory usage in the meantime is around 50-60%.

image

I am guessing this is related to the gpu buffer management. Are there some tricks to make it more memory efficient?

What is peculiar is that it's not consistent with the size of the input_ids. In the above image the first bump is caused by an 500+ token input without continuation, and it ran alright. But the third bump is a 90 token input with continuation and it throws the Error: previous buffer is not registered.

What is causing this? How can it be fixed?

OS: Windows 11 GPU: RTX 4060 8GB VRAM specs Browser: Chrome 126.0.6478.63 (Official Build) (64-bit)

guschmue commented 3 weeks ago

continuation of the dialog are not really handled yet. In theory we can use the kv_cache to avoid processing the full prompt again but I ran into some issues with the model (at least I think the issue is with the model itself). I need to find some time to look into that.

bekatan commented 3 weeks ago

are you referring to the Error: [WebGPU] Kernel "[Expand] /model/attn_mask_reformat/input_ids_subgraph/Expand" failed. Error: Expand requires shape to be broadcastable to input with the shapes in the feed like

input_ids dims: [1, seq_length]
position_ids dims: [1, seq_length]
attention_mask dims: [1, seq_length + past_sequence_length]
past_key_values.i.key dims: [1, 32, past_sequence_length, 96]
past_key_values.i.value dims: [1, 32, past_sequence_length, 96]

mentioned here?

guschmue commented 3 weeks ago

yes, that is the one