Closed HaiShaw closed 4 months ago
Thanks for the request @HaiShaw, this is a next step to tackle!
We have an example model checkpoint here that I made with a one-off script https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8-KV
Specifically, you can see in the checkpoint we store a kv_scale tensor for each attention module
@mgoin , great to that you are quite ready for this. Thanks for the details!
@mgoin Thanks! A quick question: is it possible to reproduce neuralmagic/Meta-Llama-3-70B-Instruct-FP8 by utilizing offline static quantization method (model.quantize(examples)) to Meta-Llama-3-70B-Instruct ?Addtionally, cant wait to see more details about kvcache quantization.
@zitgit Yes you can reproduce that 70B model by following the dataset example and replacing the model with whatever you'd like https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py
We are working on kvcache quantization in AutoFP8 here https://github.com/neuralmagic/AutoFP8/pull/17
@mgoin I really appreciate your reply! While I have trouble quantizing Llama3-70B which requires much more memory to process per tensor and save. Is is possible to quantize part of the model each time and finally merge the safetensors using the parameter[ignored_layers] ? Many thanks!
@zitgit How much memory are you seeing used? As of current main, it should only require peak memory equivalent to loading the model in original precision (~140GB) as we immediately quantize the weights and then begin calibration of the activations.
@mgoin It works on current main! I noticed that "del linear.weight" is necessary in func quantize_weights(). Thank you a lot!! And I'd like to ask another question politely. I noticed the comment on kvcache quantization says some arguments needed to match the representation in vllm. Do both w8a8 and kvcache fp8 inference strongly replied on vllm? Or I can use other engines.
@zitgit We are focused on format and performance in vLLM since that is the best open-source inference server with full support for FP8. AFAIK the only other option if trt-llm and it has a custom format that isn't really HF transformers-compatible, which is also what we are going for here format-wise.
I am going to close this issue since fp8 kv cache is now supported (which was the original issue). Please open a new issue if you'd like to continue conversation, thanks!
@mgoin , it would be nice if you could update the screenshot above - I think input_scale
is used in place of act_scale
. If possible can you also show output_scale
in it?
To add quantization support to KV cache, into state dict. Static (as activation) is needed for performance. Dynamic can be added for completeness.