neuralmagic / AutoFP8

Apache License 2.0
156 stars 19 forks source link

FP8 KV cache support #10

Closed HaiShaw closed 4 months ago

HaiShaw commented 5 months ago

To add quantization support to KV cache, into state dict. Static (as activation) is needed for performance. Dynamic can be added for completeness.

mgoin commented 5 months ago

Thanks for the request @HaiShaw, this is a next step to tackle!

We have an example model checkpoint here that I made with a one-off script https://huggingface.co/nm-testing/Meta-Llama-3-8B-Instruct-FP8-KV

Specifically, you can see in the checkpoint we store a kv_scale tensor for each attention module

image

HaiShaw commented 5 months ago

@mgoin , great to that you are quite ready for this. Thanks for the details!

zitgit commented 4 months ago

@mgoin Thanks! A quick question: is it possible to reproduce neuralmagic/Meta-Llama-3-70B-Instruct-FP8 by utilizing offline static quantization method (model.quantize(examples)) to Meta-Llama-3-70B-Instruct ?Addtionally, cant wait to see more details about kvcache quantization.

mgoin commented 4 months ago

@zitgit Yes you can reproduce that 70B model by following the dataset example and replacing the model with whatever you'd like https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py

We are working on kvcache quantization in AutoFP8 here https://github.com/neuralmagic/AutoFP8/pull/17

zitgit commented 4 months ago

@mgoin I really appreciate your reply! While I have trouble quantizing Llama3-70B which requires much more memory to process per tensor and save. Is is possible to quantize part of the model each time and finally merge the safetensors using the parameter[ignored_layers] ? Many thanks!

mgoin commented 4 months ago

@zitgit How much memory are you seeing used? As of current main, it should only require peak memory equivalent to loading the model in original precision (~140GB) as we immediately quantize the weights and then begin calibration of the activations.

zitgit commented 4 months ago

@mgoin It works on current main! I noticed that "del linear.weight" is necessary in func quantize_weights(). Thank you a lot!! And I'd like to ask another question politely. I noticed the comment on kvcache quantization says some arguments needed to match the representation in vllm. Do both w8a8 and kvcache fp8 inference strongly replied on vllm? Or I can use other engines.

mgoin commented 4 months ago

@zitgit We are focused on format and performance in vLLM since that is the best open-source inference server with full support for FP8. AFAIK the only other option if trt-llm and it has a custom format that isn't really HF transformers-compatible, which is also what we are going for here format-wise.

I am going to close this issue since fp8 kv cache is now supported (which was the original issue). Please open a new issue if you'd like to continue conversation, thanks!

HaiShaw commented 4 months ago

@mgoin , it would be nice if you could update the screenshot above - I think input_scale is used in place of act_scale. If possible can you also show output_scale in it?