Use fewer barriers - Githubissues

Everything after a barrier waits on everything before so we don't need to synchronize on anything that happened before a barrier. This change implements that.

Some benchmarks on M2 Ultra:

Bench	Pre	Post
4-bit Mistral 7B generating 512 toks	122.0	123.8
4-bit Llama 1B generating 512 toks	420.2	431.3
Transformer training	6.237	6.265

On LeNet and MNIST no change observed.

With this change we can hit > 130 toks/sec for 4-bit Mistral 7B by increasing ops per buffer:

MLX_MAX_OPS_PER_BUFFER=80 mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "Write a story about Einstein"  --temp 0.0 --max-tokens 512

Pre: 125.939 tokens-per-sec Post: 130.900 tokens-per-sec

ml-explore / mlx

Use fewer barriers #1561