[REQUEST] How to use int8 quantization inference without training?

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://www.deepspeed.ai/

Apache License 2.0

33.72k stars 3.96k forks source link

Open KimmiShi opened 11 months ago

KimmiShi commented 11 months ago

Hi, I read the docs about zero_quant, but it seems to require extra training.

And in deepspeed.init_inference, the dtype can be set to int8, but the code does nothing for int8. https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/engine.py#L521

Is is possible to quantize an exisiting llm model directly and inference?

yxdr commented 11 months ago

I have the same question.

hgzjy25 commented 11 months ago

same to you, i found the int8 and bfloat16 consume the same cuda memory

StarLooo commented 8 months ago

I met the same question...

j-dominguez9 commented 6 months ago

same to you, i found the int8 and bfloat16 consume the same cuda memory

second this.