Closed ghost closed 1 month ago
That's correct. We check if quantized size will be smaller than the original size, only if so we do the quantization. q8p uses a look-up table of 256 elements (of FP16). So anything less than 256 + 128 elements will be not useful with quantization: https://github.com/liuliu/s4nnc/blob/main/nnc/Store.swift#L1451
Also, there are limitations on kernels (for example, our layer norm / rms norm kernels on CUDA don't support quantized tensors (for scale / bias), and our GEMM / convolution kernels on both CUDA and Metal don't support quantized bias) to make sure prefer only quantize weight matrix, see code here: https://github.com/drawthingsai/draw-things-community/blob/main/Apps/ModelQuantizer/Quantizer.swift#L34
Thanks a lot
Actually it only happens for CUDA. for mps, it is still quantizing the layers with 320 params.
@liuliu i am trying to disable it so that even small tensors get quantized. But after removing the guard, it gives illegal memory access on CUDA.
Actually it only happens for CUDA. for mps, it is still quantizing the layers with 320 params.
Not certain. One the quirk of our quantization is that depending on the original tensor, the LUT table can be either in FP32 or in FP16. That's why in production code, we first convert to Float16 and then save the quantization result write("key", tensor: Tensor<Float16>(from: originalTensor), codec: [.q8p, .ezm7])
.
i am trying to disable it so that even small tensors get quantized. But after removing the guard, it gives illegal memory access on CUDA.
There will be more to be changed if you deliberately want the quantized tensor to be bigger than the un-quantized ones because we always assumed the data size of the payload will be at most the un-quantized tensor size (i.e. number of elements * sizeof(float or float16)). If it is bigger, there are some allocations are not large enough and will cause mem overflow errors elsewhere.
I was quantizing weights using :
but it looks like some params with less number of elements are not being quantized. Like layers with 320 params. did you add a check or something? where is it ?