Open Louis-y-nlp opened 7 months ago
I think you can use one 32GB GPU to hold an fp16 version of the llama-2-13b model.
And you can set lade.config_lade(LEVEL=5, WINDOW_SIZE=10, GUESS_SET_SIZE=10, DEBUG=1)
or something smaller for a 13b model. The default setting is too costy for a 13b model.
@Viol2000 What key factor that would derive a better configuration for models with different sizes? Like you've motioned above
Hi @yhyu13 . You can check the table 1 in our blog. We require large extra flops to predict tokens. When the GPU is weak or the model is larger, we need to reduce this cost (and also, we will predict fewer tokens), or it will bring a slowdown.
@Viol2000 Thank you for your assistance. After adjusting the parameters, I observed a slight improvement in inference speed, increasing from 18token/s to approximately 21token/s.
It seems the speedup is still very low. Maybe adjusting the super-parameters can help further. However, I think the main reason is that V100 does not have extra flops to do a heavy lookahead and verification branch for the 13b model.
Thanks for your work. I used your demo code, but I did not observe any speed improvement; instead, I noticed a decrease in speed. I used a V100-32G GPU and ran the minimal.py on a finetuned llama2-13b model.
Here is my requirements infomation: