Closed awni closed 14 hours ago
This is really an issue in MLX core with setting streams with compiled functions. But it is not trivial to fix and will require a new release, so patching it here as well:
mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt - -m 256 < prompt20.txt
Pre:
Prompt: 1929 tokens, 767.674 tokens-per-sec Generation: 256 tokens, 41.312 tokens-per-sec Peak memory: 5.082 GB
Post:
Prompt: 1929 tokens, 775.695 tokens-per-sec Generation: 256 tokens, 72.733 tokens-per-sec Peak memory: 5.197 GB
This is really an issue in MLX core with setting streams with compiled functions. But it is not trivial to fix and will require a new release, so patching it here as well:
Pre:
Post: