Closed KiwiHana closed 1 year ago
Maybe similar reason as #8758.
starcoder's attention layer has two more memory copy than llama, and these copy's overhead increases rapidly as context length increases.
so starcoder's 2nd+ token latency is about 5X when context length = 1800, while llama's 2nd+ token latency is about 1.5X when context length = 1800
I have made a small improvement on starcoder's memory copy, now its 2nd+ token latency is about 3X when context length = 1800.
Here is brief performance table with this improvement: | history length | 2nd+ latency(ms/token) |
---|---|---|
142 | 183 | |
661 | 241 | |
1254 | 364 | |
1885 | 532 |
Thanks
OS: windows 11, 13900H bigdl 0815
long 2nd Avg latency from No.2 chat.
Attach code