google-ai-edge / mediapipe

Cross-platform, customizable ML solutions for live and streaming media.
https://mediapipe.dev
Apache License 2.0
26.09k stars 5.04k forks source link

Pre-download mini-benchmark for in-browser (LLM) inference performance #5468

Open maudnals opened 3 weeks ago

maudnals commented 3 weeks ago

MediaPipe Solution (you are using)

MediaPipe LLM Inference API

Programming language

TBD

Are you willing to contribute it

Yes

Describe the feature and the current behaviour/state

At the moment, for Gen AI use cases in the browser e.g. Gemma 2B with the MediaPipe LLM Inference API, there's no way for a developer to know ahead of time whether the model can actually run on the device within reasonable times. This is an issue because:

  1. For Gen AI, the model download is really large (1.3GB almost for Gemma 2B, which is manyfold the recommended web app size)
  2. Running an inference on devices that have low spec or too much operations already running may be really slow, or even crash a device (on mobile).

This leads to a subpar UX where a user may have waited to download a large model that can't actually run inferences within reasonable times on their device, or that may even crash their device. What if we ran a mini-benchmark ahead of model download? This is beaufortfrancois@'s idea he suggested for Transformers.js: https://github.com/xenova/transformers.js/pull/545#issuecomment-2147465443. This would involve running the model code with zeroed-out weights.

Will this change the current API? How?

Yes, as we'd want to expose to developers the output of the mini-benchmark. This output may be abstracted behind a few dev-friendly performance buckets e.g. high, medium, low. Developers could overlay their own logic based on that output.

Who will benefit with this feature?

All developers for on-device/in-browser use cases

Please specify the use cases for this feature

All on-device/in-browser use cases

Any Other info

No response