Provide an accuracy testing interface?

Thank you for interesting in our work! The instruction is given in README. If you are interested in reproducing the reported number if our paper,

For LongBench, you can obtain the result by

# In our paper, K_BITS==V_BITS==2, GROUP_LENGTH==32, RESIDUAL_LENGTH==128
bash scripts/long_test.sh {GPU_ID} {K_BITS} {V_BITS} {GROUP_LENGTH} {RESIDUAL_LENGTH} {MODEL_NAME}
python eval_long_bench.py --model {MODEL} # MODEL is the dir name under pred/ Currently it support Llama family model and Mistral model.

For tasks like GSM8K, CoQA, TruthfulQA, you can obtain the result by

git checkout -b lmeval
git pull

cd lm-evaluation-harness
pip install -e .
cd ..

# We report TASK in {coqa, truthfulqa_gen, gsm8k} in our paper.
# If use KIVI implementation, set K_BITS and V_BITS to 2 or 4.
# If use baseline, set K_BITS and V_BITS to 16.
bash scripts/lmeval_test.sh {GPU_ID} {K_BITS} {V_BITS} {GROUP_LENGTH} {RESIDUAL_LENGTH} {TASK} {MODEL_NAME}

Let me know if you cannot reproduce our results or have further questions.

jy-yuan / KIVI

Provide an accuracy testing interface? #8