Open KimmiShi opened 11 months ago
I have the same question.
same to you, i found the int8 and bfloat16 consume the same cuda memory
I met the same question...
same to you, i found the int8 and bfloat16 consume the same cuda memory
second this.
Hi, I read the docs about
zero_quant
, but it seems to require extra training.And in
deepspeed.init_inference
, thedtype
can be set to int8, but the code does nothing for int8. https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/engine.py#L521Is is possible to quantize an exisiting llm model directly and inference?