liuliu / s4nnc

Swift for NNC
https://libnnc.org
BSD 3-Clause "New" or "Revised" License
70 stars 8 forks source link

Quantizing q8p #24

Closed ghost closed 1 month ago

ghost commented 2 months ago

I was quantizing weights using :

graph.openStore(
  full_f16_path, flags: .truncateWhenClose
) { store in
  let keys = store.keys
  graph.openStore(
    f8_path,
    flags: .truncateWhenClose
  ) {
    for key in keys {
      guard let tensor = store.read(key) else { continue }

      print("quantizing  \(key) \(tensor)")

      $0.write(key, tensor: tensor, codec: [.q8p ])
    }
  }
}

but it looks like some params with less number of elements are not being quantized. Like layers with 320 params. did you add a check or something? where is it ?

liuliu commented 2 months ago

That's correct. We check if quantized size will be smaller than the original size, only if so we do the quantization. q8p uses a look-up table of 256 elements (of FP16). So anything less than 256 + 128 elements will be not useful with quantization: https://github.com/liuliu/s4nnc/blob/main/nnc/Store.swift#L1451

Also, there are limitations on kernels (for example, our layer norm / rms norm kernels on CUDA don't support quantized tensors (for scale / bias), and our GEMM / convolution kernels on both CUDA and Metal don't support quantized bias) to make sure prefer only quantize weight matrix, see code here: https://github.com/drawthingsai/draw-things-community/blob/main/Apps/ModelQuantizer/Quantizer.swift#L34

ghost commented 2 months ago

Thanks a lot

ghost commented 2 months ago

Actually it only happens for CUDA. for mps, it is still quantizing the layers with 320 params.

ghost commented 2 months ago

@liuliu i am trying to disable it so that even small tensors get quantized. But after removing the guard, it gives illegal memory access on CUDA.

liuliu commented 2 months ago

Actually it only happens for CUDA. for mps, it is still quantizing the layers with 320 params.

Not certain. One the quirk of our quantization is that depending on the original tensor, the LUT table can be either in FP32 or in FP16. That's why in production code, we first convert to Float16 and then save the quantization result write("key", tensor: Tensor<Float16>(from: originalTensor), codec: [.q8p, .ezm7]).

i am trying to disable it so that even small tensors get quantized. But after removing the guard, it gives illegal memory access on CUDA.

There will be more to be changed if you deliberately want the quantized tensor to be bigger than the un-quantized ones because we always assumed the data size of the payload will be at most the un-quantized tensor size (i.e. number of elements * sizeof(float or float16)). If it is bigger, there are some allocations are not large enough and will cause mem overflow errors elsewhere.