Everything after a barrier waits on everything before so we don't need to synchronize on anything that happened before a barrier. This change implements that.
Some benchmarks on M2 Ultra:
Bench
Pre
Post
4-bit Mistral 7B generating 512 toks
122.0
123.8
4-bit Llama 1B generating 512 toks
420.2
431.3
Transformer training
6.237
6.265
On LeNet and MNIST no change observed.
With this change we can hit > 130 toks/sec for 4-bit Mistral 7B by increasing ops per buffer:
MLX_MAX_OPS_PER_BUFFER=80 mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "Write a story about Einstein" --temp 0.0 --max-tokens 512
Everything after a barrier waits on everything before so we don't need to synchronize on anything that happened before a barrier. This change implements that.
Some benchmarks on M2 Ultra:
On LeNet and MNIST no change observed.
With this change we can hit > 130 toks/sec for 4-bit Mistral 7B by increasing ops per buffer:
Pre:
125.939 tokens-per-sec
Post:130.900 tokens-per-sec