ml-explore / mlx-examples

Examples in the MLX framework
MIT License
6.27k stars 895 forks source link

Put prompt processing in same stream #1122

Closed awni closed 14 hours ago

awni commented 16 hours ago

This is really an issue in MLX core with setting streams with compiled functions. But it is not trivial to fix and will require a new release, so patching it here as well:

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --prompt - -m 256 < prompt20.txt

Pre:

Prompt: 1929 tokens, 767.674 tokens-per-sec
Generation: 256 tokens, 41.312 tokens-per-sec
Peak memory: 5.082 GB

Post:

Prompt: 1929 tokens, 775.695 tokens-per-sec
Generation: 256 tokens, 72.733 tokens-per-sec
Peak memory: 5.197 GB