Closed shifeiwen closed 4 weeks ago
Thank you for your interest in our work!
The llama-1.1b-w4a8-sym
model has an average latency of 40 ms over 100 runs on a Samsung Galaxy S24. The video we show in the README.md achieves 23 tok/s. However, these numbers may changes across devices and runs.
We're working on deploying Gemma-2-2b-it
which is well-tuned for chatting. The preliminary experiment show it has a similar latency as TinyLlaMA-1.1B-Chat
, i.e. around 40ms per-token, but there are still technical issues to be resolved so this number may also change.
While the gemma paper reports a mmlu of 42.3 with 5-shots for the 2B model, the results we report here are all zero-shot. In our experiments, gemma 2B achieves a zero-shot mmlu score of 28 as shown in our paper. The numbers can be reproduced using the command in eval. Please note that to reproduce the fp result of the original model, the command will be
CKPT=ORIGINAL_GEMMA_2B_PATH
CUDA_VISIBLE_DEVICES=0 python eval/harness_eval.py \
--tasks wikitext,arc_challenge,hellaswag,hendrycksTest* \
--mode hf --hf_path ${CKPT} --output_dir ${OUTPUT_DIR}
Correct, to benchmark the on-device performance (e.g. average latency) and export the on-device model, please check out device/README.md
feel free to reopen the issue if there are any further questions.
This is a very valuable project for research. I tried to download it. The demo cpp llama1.1b w4a8 can run at 18 t/s. Although the output of the model is not what I asked, it seems that there is no problem with the overall syntax. I would like to ask,