turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.48k stars 261 forks source link

[qesstion] Wrapper Linear API and 2bits #589

Open wenhuach21 opened 1 month ago

wenhuach21 commented 1 month ago

Thanks for your great work.

1 Is there an API for packing the linear layer and running inference in the GPTQV2 format, similar to what's provided here: https://github.com/AutoGPTQ/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear/qlinear_exllamav2.py?

2 is 2-bit quantization supported?

turboderp commented 1 month ago
  1. I'm not sure what you mean, GPTQv2 is supported since recently. The difference was just whether the qzeros tensor was offset by one or not, and ExLlama now figures that out from the config.json.

  2. 2-bit quantization is supported in EXL2, but there's no kernel yet for 2-bit GPTQ tensors. It is planned, but I have so many other things to get to also.

wenhuach21 commented 1 month ago

Thank you for your quick response.

I. I'm currently working on an INT4 algorithm and need to export the model to another format due to specific requirements. We plan to use your repo as the default CUDA backend. Could you let me know if there is an interface available to replace the original Linear Layer with your INT4 layer, as I am not familiar with the kernel part.

This is our repo https://github.com/intel/auto-round

turboderp commented 4 weeks ago

I have plans to create a torch.nn module for EXL2 linear layers, but I'm so busy with tensor parallel inference at the moment I'm not sure I'll get to it for at least a little while.

In the meantime you could look at this, which is the AutoGPTQ implementation of a GPTQ(v2) Linear module using the ExLlamaV2 kernel.

If you wanted to support the EXL2 format rather than GPTQ, note that it's symmetric only, uses quantized scales and variable bitrates within each tensor (essentially by slicing it into rows and providing a variable number of 8-bit, 6-bit, 5-bit etc. rows in that order, sorting the rows by activation order to place more salient weights on top.

Both the GPTQ and EXL2 implementations use an unmanaged object (QMatrix) to store shapes and pointers for the weights, which reduces Python/pybind overhead and makes the matrix easily accessible from other C++ portions of ExLlama, but probably isn't too relevant for a torch.nn implementation and would lead to slight memory leaks if the layers aren't explicitly unloaded before being garbage-collected.

Either way the interface for the extension is just:

def gemm(
    x: torch.Tensor,  # Input tensor, FP16, contiguous
    q_handle: int,  # uintptr_t to QMatrix
    q4_width: int,  # out_features
    force_cuda: bool,  # Optionally disable the reconstruct/cuBLAS path for large inputs
):
    # Final shape of output tensor
    output_shape = x.shape[:-1] + (q4_width,)

    # Flatten input tensor to matrix
    x = x.view(-1, x.shape[-1])

    # Prepare empty tensor for result
    output = torch.empty((x.shape[0], q4_width), dtype=torch.half, device=x.device)

    # Call the extension function
    gemm_half_q_half(x, q_handle, output, force_cuda)

    # Restore output dimensions
    return output.view(output_shape)

What particular requirements would your format have? Is it the GPTQ tensor format, or does it deviate from it somehow?

wenhuach21 commented 3 weeks ago

Thank you for your detailed reply!

Yes, we need a similar torch.nn module for EXL2 linear layers, which will make integration easier.

AutoGPTQ should already support asymmetric quantization, while symmetric performs poorly at 2 bits.

Our format is built on GPTQ's but removes the qzero±1. We also use different configurations to support mixed precisions and a broader range of devices; however, this should not place any additional requirements on the kernel side.