Open kewang-xlnx opened 4 days ago
In general we welcome contribution that converts quark format to the standardized format of LLM compressor https://github.com/vllm-project/llm-compressor, @robertgshaw2-neuralmagic @mgoin can help provide pointers.
Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy. Here is the introduction to Quark. Currently, the format of the quantized model exported by Quark is different from the formats supported by VLLM, so we need to contribute codes to VLLM to add support for the Quark format.
Quark Format
1) configuration file config.json of Quark format 2) key names and data types of Quark safetensors
3) KV scale format if kv cache used
Design
Add the quark format to ROCm/vllm repo by creating a directory for it in vllm/model_executor/layers/quantization and including the following files.
At the first stage, we will first integrate the FP8 quantification in Quark format into VLLM, and then integrate other Quark formats such as INT4/INT8 per_tensor/per_channel/per_group into VLLM later when needed.