Open j0h0k0i0m opened 2 weeks ago
On mobile phone, I don't think q416_1
offers better performance. For prefill stage, q4f16_0
provides much better performance than q4f16_1
@Hzfengsy
Thank you for replying to the issue.
It's understandable to use q4f16_0
due to the prefill stage, but I recall seeing an issue raised earlier stating that the decoding performance is lower. Currently, the prefill tokens per second for the phi-3.5-mini model (q4f16_1
) are below 1, and I would like to achieve a better quality response than q4f16_0
with a suitable prefill. Is there any way to do this?
❓ General Questions
Hello, I have some questions regarding the Android app.
Currently, I am using
q4f16_0
quantization, but there's a significant difference in prefill tokens per second compared toq4f16_1
. I’m using the phi-3.5-mini model as a basis, but when testingq4f16_1
, the device (Galaxy S24 Ultra) even shuts down entirely. I understand thatq4f16_1
generally offers better performance, so I’d like to ask if there are any ways to improve this.Is the repetition penalty working correctly? I couldn't find a parameter for it in the ChatCompletionRequest within the app, so I'm unsure if it functions as expected. When reviewing the generated sentences, it produces a continuous sequence in a similar style, which suggests it may not be applied properly.
Thanks.