ml-explore / mlx

MLX: An array framework for Apple silicon
https://ml-explore.github.io/mlx/
MIT License
15.01k stars 856 forks source link

Block sparse qmm #1124

Closed angeloskath closed 2 weeks ago

angeloskath commented 2 weeks ago

The block_sparse_mm equivalent for quantized matmuls. A big diff but mostly boilerplate code and refactoring.

angeloskath commented 2 weeks ago

@awni and @jagrit06 I made two changes that are not directly related to the block_sparse_qmm but general quantization. So I 'd appreciate a review. Also if you think that they should be on a different PR (or maybe just the 2nd change) let me know and I can merge this before these commits and make another PR for this.

The changes are the following:

  1. quantize and dequantize now support arbitrary shapes instead of only 2D matrices
  2. I added a to_quantized method to Linear and Embedding and changed nn.quantized to check for modules with the to_quantized method implemented. This allows arbitrary modules to support quantization using nn.quantize without MLX knowing anything about them.