PPL test result - Githubissues

shifeiwen commented 2 months ago

This is a very valuable project for research. I tried to download it. The demo cpp llama1.1b w4a8 can run at 18 t/s. Although the output of the model is not what I asked, it seems that there is no problem with the overall syntax. I would like to ask,

Is there a more detailed test on PPL here? The gemma2b fp32 mmlu is 51.3, but the result you showed is 25.8, which seems to be a lot of points off
I can understand that the quantization method based on W4A8 and W8A8 is implemented through your mobilequant, then saved through aimet, and then used QNN inference. Is this correct? I think this project is a good start. Thanks to the project team for their contribution.

fwtan commented 2 months ago

Thank you for your interest in our work!

The llama-1.1b-w4a8-sym model has an average latency of 40 ms over 100 runs on a Samsung Galaxy S24. The video we show in the README.md achieves 23 tok/s. However, these numbers may changes across devices and runs.
We're working on deploying Gemma-2-2b-it which is well-tuned for chatting. The preliminary experiment show it has a similar latency as TinyLlaMA-1.1B-Chat, i.e. around 40ms per-token, but there are still technical issues to be resolved so this number may also change.
While the gemma paper reports a mmlu of 42.3 with 5-shots for the 2B model, the results we report here are all zero-shot. In our experiments, gemma 2B achieves a zero-shot mmlu score of 28 as shown in our paper. The numbers can be reproduced using the command in eval. Please note that to reproduce the fp result of the original model, the command will be
```
CKPT=ORIGINAL_GEMMA_2B_PATH
CUDA_VISIBLE_DEVICES=0 python eval/harness_eval.py \
  --tasks wikitext,arc_challenge,hellaswag,hendrycksTest* \
  --mode hf --hf_path ${CKPT} --output_dir ${OUTPUT_DIR}
```
Correct, to benchmark the on-device performance (e.g. average latency) and export the on-device model, please check out device/README.md

fwtan commented 4 weeks ago

feel free to reopen the issue if there are any further questions.

saic-fi / MobileQuant

PPL test result #1