1.8X speedup by disable_low_gpu_mem_usage and reduce memory usage by avoid using torch.cat

intel / auto-round

Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs"

https://arxiv.org/abs/2309.05516

Apache License 2.0

172 stars 20 forks source link

1.8X speedup by disable_low_gpu_mem_usage and reduce memory usage by avoid using torch.cat #106

Closed wenhuach21 closed 3 months ago

wenhuach21 commented 3 months ago

smoke test done: llama3 with lmhead bachuan13b with lmhead chatglm3(lm head name transformer.output_layer) opt tied_lm-head

gemma-7b

phi-2 lm head

mixtral

Qwen1.5-7B-Chat lm-head

Baichuan2-7B-Chat lm-head

gpt-j-6b lm-head

LaMini-GPT-124M conv1d tied weight

gpt-neo-125m lm-head tied weight

dolly-v2-3b embed_out

stablelm-base-alpha-3 tied embed_out

bloom7b1 tied lm-head

Phi-3-mini-4k-instruct lm-head

solar lm-head

llama3_8b_instruct-chat lm-head

codegen25-7b-mult 4.33.2 lm-head