A high-performance inference system for large language models, designed for production environments.
316
stars
23
forks
source link
fix: fix weight load issue for fused qkv and added more unittests for weight loading #213
Closed
guocuimi closed 1 month ago
replace
repeat
withrepeat_interleave
to repeat kv weights along kv_head dim for scenarios n_kv_heads < world_size