Describe the feature and the current behaviour/state
At the moment, for Gen AI use cases in the browser e.g. Gemma 2B with the MediaPipe LLM Inference API, there's no way for a developer to know ahead of time whether the model can actually run on the device within reasonable times. This is an issue because:
For Gen AI, the model download is really large (1.3GB almost for Gemma 2B, which is manyfold the recommended web app size)
Running an inference on devices that have low spec or too much operations already running may be really slow, or even crash a device (on mobile).
This leads to a subpar UX where a user may have waited to download a large model that can't actually run inferences within reasonable times on their device, or that may even crash their device.
What if we ran a mini-benchmark ahead of model download? This is beaufortfrancois@'s idea he suggested for Transformers.js: https://github.com/xenova/transformers.js/pull/545#issuecomment-2147465443.
This would involve running the model code with zeroed-out weights.
Will this change the current API? How?
Yes, as we'd want to expose to developers the output of the mini-benchmark. This output may be abstracted behind a few dev-friendly performance buckets e.g. high, medium, low. Developers could overlay their own logic based on that output.
MediaPipe Solution (you are using)
MediaPipe LLM Inference API
Programming language
TBD
Are you willing to contribute it
Yes
Describe the feature and the current behaviour/state
At the moment, for Gen AI use cases in the browser e.g. Gemma 2B with the MediaPipe LLM Inference API, there's no way for a developer to know ahead of time whether the model can actually run on the device within reasonable times. This is an issue because:
This leads to a subpar UX where a user may have waited to download a large model that can't actually run inferences within reasonable times on their device, or that may even crash their device. What if we ran a mini-benchmark ahead of model download? This is beaufortfrancois@'s idea he suggested for Transformers.js: https://github.com/xenova/transformers.js/pull/545#issuecomment-2147465443. This would involve running the model code with zeroed-out weights.
Will this change the current API? How?
Yes, as we'd want to expose to developers the output of the mini-benchmark. This output may be abstracted behind a few dev-friendly performance buckets e.g.
high
,medium
,low
. Developers could overlay their own logic based on that output.Who will benefit with this feature?
All developers for on-device/in-browser use cases
Please specify the use cases for this feature
All on-device/in-browser use cases
Any Other info
No response