ml-explore / mlx

MLX: An array framework for Apple silicon
https://ml-explore.github.io/mlx/
MIT License
17.37k stars 1.01k forks source link

Use fewer barriers #1561

Closed awni closed 2 weeks ago

awni commented 2 weeks ago

Everything after a barrier waits on everything before so we don't need to synchronize on anything that happened before a barrier. This change implements that.

Some benchmarks on M2 Ultra:

Bench Pre Post
4-bit Mistral 7B generating 512 toks 122.0 123.8
4-bit Llama 1B generating 512 toks 420.2 431.3
Transformer training 6.237 6.265

On LeNet and MNIST no change observed.

With this change we can hit > 130 toks/sec for 4-bit Mistral 7B by increasing ops per buffer:

MLX_MAX_OPS_PER_BUFFER=80 mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "Write a story about Einstein"  --temp 0.0 --max-tokens 512

Pre: 125.939 tokens-per-sec Post: 130.900 tokens-per-sec