google-ai-edge / mediapipe-samples

Apache License 2.0
1.45k stars 370 forks source link

Is Gemma on device really this slow ? #379

Open MJ1998 opened 3 months ago

MJ1998 commented 3 months ago

I used llm_inference sample with gemma-2b-it-cpu-int4.bin on Pixel 8 Pro emulator.

The prefill speed seems to be in minutes.

Pixel 8 Pro configurations:- RAM - 22GB, VM heap - 512mb

Reference video https://github.com/googlesamples/mediapipe/assets/22965002/c7730dba-48e8-4eec-ae68-fe847d2778f2

PaulTR commented 3 months ago

Oh boy, no definitely not. It's not really intended to be run on the emulator, so your results are going to vary wildly. Here's a presentation I did last week with a slide showing Gemma running on a device in real-time (not sped up or altered, just recorded and turned into a gif) https://docs.google.com/presentation/d/1uetAcmkNWDXHEJaCt6WoBflDM1iMUU1N1ahzQof6PLM/edit#slide=id.g26cd5c56ad9_1_30

MJ1998 commented 3 months ago

I saw a post suggesting emulator with increased ram works similarly. Here it is - link - Search for "Creating an Android Emulator with Increased RAM"

What's the difference that makes physical device so much faster ? Is it particularly customized for gemma ?

Thanks for the prompt response!

PaulTR commented 3 months ago

No idea on that level of detail. My general experience over the last 10+ years with Android development though has always been "Eh, emulators are OK, but never as good as a real device"

MJ1998 commented 3 months ago

Time to first token is still pretty slow compared to the video you shared. Takes around 15 seconds for both 4bit and 8bit cpu versions of gemma2b. Physical device that I am using is pixel 7 pro.