opengear-project / GEAR

GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
MIT License
128 stars 10 forks source link

Can't reproduce the benchmarks #8

Closed cyLi-Tiger closed 3 months ago

cyLi-Tiger commented 3 months ago

Thanks for the work and I try to reproduce the experiments in your paper.

I use the llama2-7b-chat and the default command in /GEAR/GenerationBench/GenerationTest/run.sh to run evaluation_gsm8k.pyand get image

I then change the compress_method into None and get image

It seems that the accuracy degradation is way larger than you the results in Table 2 as the accuray of gsm8k just loss ~0.4. I'm using one A100 80G, please correct me if I misunderstand anything!

Besides, I have some questions:

  1. It takes longer time to complete gsm8k for quantized model, is this expected? Or it's just because we don't have the kernel for 4-bit calculation and the extra quantizing operations slow the inference.
  2. How do I actually measure the compress ratio during inference, your compression rate in the paper seems to come from the formula you defined.
  3. What's the diffrence between TrueCompression and Simulated?
  4. In Appendix C you mentioned 2.8x throughput, how to reproduce that?
HaoKang-Timmy commented 3 months ago

Your fp16 benchmark is even wrong so definately you can not reproduce the result. Since you do not provide any details about the experiment setting, I assume you just replace the model name command in run.sh. In run.sh file we run gsm8k-cot and in table2 we have results about gsm8k-zeroshot which are quite different.

HaoKang-Timmy commented 3 months ago

Also, many of your questions are well stated in Read.me file or paper. I would like to suggest you to read them first.