ml-explore / mlx

MLX: An array framework for Apple silicon
https://ml-explore.github.io/mlx/
MIT License
17.48k stars 1.01k forks source link

Fix OOB access in qmv #1577

Closed barronalex closed 2 weeks ago

barronalex commented 2 weeks ago

This was breaking quantized generation with SmolLM2-135M-Instruct specifically with group size 32.

No clear difference in performance before and after (tested with group size 64 which still worked on main).

Before:

Prompt: 38 tokens, 629.657 tokens-per-sec
Generation: 574 tokens, 261.939 tokens-per-sec
Peak memory: 0.164 GB

After:

Prompt: 38 tokens, 630.536 tokens-per-sec
Generation: 574 tokens, 262.050 tokens-per-sec
Peak memory: 0.164 GB
barronalex commented 2 weeks ago

No worries! Totally agree with the comment -- I updated it.

The benchmark is with group size 64 since that worked with main. The qmv size is (192, 576) so it should never go to qmv_fast (also I don't think group size affects the routing at the moment).