[Question] Android app issue

j0h0k0i0m commented 2 weeks ago

❓ General Questions

Hello, I have some questions regarding the Android app.

Currently, I am using q4f16_0 quantization, but there's a significant difference in prefill tokens per second compared to q4f16_1. I’m using the phi-3.5-mini model as a basis, but when testing q4f16_1, the device (Galaxy S24 Ultra) even shuts down entirely. I understand that q4f16_1 generally offers better performance, so I’d like to ask if there are any ways to improve this.
Is the repetition penalty working correctly? I couldn't find a parameter for it in the ChatCompletionRequest within the app, so I'm unsure if it functions as expected. When reviewing the generated sentences, it produces a continuous sequence in a similar style, which suggests it may not be applied properly.

Thanks.

Hzfengsy commented 2 weeks ago

On mobile phone, I don't think q416_1 offers better performance. For prefill stage, q4f16_0 provides much better performance than q4f16_1

j0h0k0i0m commented 2 weeks ago

@Hzfengsy

Thank you for replying to the issue.

It's understandable to use q4f16_0 due to the prefill stage, but I recall seeing an issue raised earlier stating that the decoding performance is lower. Currently, the prefill tokens per second for the phi-3.5-mini model (q4f16_1) are below 1, and I would like to achieve a better quality response than q4f16_0 with a suitable prefill. Is there any way to do this?

mlc-ai / mlc-llm

[Question] Android app issue #3010

❓ General Questions