huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
13.79k stars 751 forks source link

Use the faster rms-norm kernel for llama. #2107

Closed LaurentMazare closed 3 weeks ago

LaurentMazare commented 3 weeks ago

The llama example was still using the very slow rms-norm variant, this switches it (and all other models) to use the faster kernel. On a H100 with flash-attn enabled, generation speed for llama-v3 8b goes from 74.4 token/s to 88.4 token/s.