Split encoders in non-concurrent context with a max ops per encoder

Speeds up generation and slightly training, some benchmarks:

Benchmarks on an M2 Ultra

python -m mlx_lm.generate --model mlx-community/NeuralBeagle14-7B-4bit-mlx --prompt "Write a story about Albert Einstein" --temp 0.0 --max-tokens 256

Pre: Prompt: 222.239 tokens-per-sec Generation: 107.239 tokens-per-sec

Post: Prompt: 223.145 tokens-per-sec Generation: 108.463 tokens-per-sec

Pre: Test accuracy 0.936, Time 0.632 (s) Post: Test accuracy 0.927, Time 0.625 (s)

Pre: Test accuracy 0.981, Time 2.797 (s) Post: Test accuracy 0.975, Time 2.757 (s)

Pre: Throughput: 6462.77 samples/second, Peak memory 1.663 (GB) Post: Throughput: 6474.81 samples/second, Peak memory 1.663 (GB)

python main.py --gpu

Pre: Iter 40: Train loss 7.864, It/sec 5.881, Peak memory 5.534 (GB) Post: Iter 40: Train loss 7.814, It/sec 5.902, Peak memory 5.534 (GB)

ml-explore / mlx