Open neilsun2009 opened 1 month ago
Hi @neilsun2009,
Could you please confirm if you are testing this on a physical Xiaomi 14 Pro device or an emulator? This information will help us better understand the issue.
Thank you!!
Hi @kuaashish , it's tested on a physical device.
Have I written custom code (as opposed to using a stock example script provided in MediaPipe)
None
OS Platform and Distribution
Android 14
Mobile device if the issue happens on mobile device
Xiaomi 14 Pro
Browser and version if the issue happens on browser
No response
Programming Language and version
Kotlin
MediaPipe version
0.10.14
Bazel version
No response
Solution
LLM Inference
Android Studio, NDK, SDK versions (if issue is related to building in Android environment)
SDK 34
Xcode & Tulsi version (if issue is related to building for iOS)
No response
Describe the actual behavior
Generate repeated tokens if prompt is too long
Describe the expected behaviour
Generate correctly
Standalone code/steps you may have used to try to get what you need
Other info / Complete Logs
I'm building an RAG-based mobile app using MediaPipe LLM Inference API, with the gemma 1.1 2b int8 gpu checkpoint downloaded directly from Kaggle.
Based on my experiment, the solution works fine with a short prompt, say if I'm only retrieving 1 chunk of document, and clip it to 100 chars as in the following prompt:
The output is fine:
But if i keep the whole piece of retrieved text chunk, as in this prompt:
The model output is something like this, and it won't end until max token is reached:
As Gemma 1.1 2B allows a context window of 8k tokens, the 2nd prompt should not be a problem. Also, CPU version of the same model works fine even with 5 retrieved chunks. So I assume the problem lies somewhere in the GPU execution process in MediaPipe LLM Inference.