huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.52k stars 451 forks source link

Slow packing times for GPTQ #1943

Open zankner opened 3 months ago

zankner commented 3 months ago

System Info

optimum=1.20.0, python=3.11.9, torch=2.3.1+cu121, system=ubuntu 20.04

Who can help?

No response

Information

Tasks

Reproduction (minimal, reproducible, runnable)

I'm quantizing opt-350m using gptq. The actual quantization is fast but then packing layers is slow. The code to quantize the models is as follows:

quant_config = GPTQConfig(
    bits=args.bits,
    group_size=128,
    tokenizer=tokenizer,
    dataset=["c4-new"],
)

quantized_model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-125m",
    quantization_config=quant_config,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True, # Remove later
    device_map="cuda:0"
)

I added some timings to GPTQ packing (https://github.com/huggingface/optimum/blob/5c803db8cef21b22d0bdbf8a69653b74656e193e/optimum/gptq/quantizer.py#L614):

import time
for name in qlayers:
    logger.info(name)
    start = time.time()
    quantizers[name], scale, zero, g_idx = quantizers[name]
    # so far can only pack layer on CPU
    layer_device = qlayers[name].device
    qlayers[name].to("cpu")
    layers[name], scale, zero, g_idx = layers[name].to("cpu"), scale.to("cpu"), zero.to("cpu"), g_idx.to("cpu")
    qlayers[name].pack(layers[name], scale, zero, g_idx)
    qlayers[name].to(layer_device)
    print(f"Time to pack {name}: {time.time() - start}")

This has timings:

Time for transformer.blocks.0.attn.Wqkv: 0.24 Time for transformer.blocks.0.attn.out_proj: 0.08 Time for transformer.blocks.0.ffn.down_proj: 0.25 Time for transformer.blocks.0.ffn.up_proj: 91.95

However, if I run it in parallel (which I think preserves everything) as:

from concurrent.futures import ThreadPoolExecutor
import time

def pack_layer(name):
    logger.info(name)
    start = time.time()
    quantizers[name], scale, zero, g_idx = quantizers[name]
    layer_device = qlayers[name].device
    qlayers[name].to("cpu")
    layers[name], scale, zero, g_idx = layers[name].to("cpu"), scale.to("cpu"), zero.to("cpu"), g_idx.to("cpu")
    qlayers[name].pack(layers[name], scale, zero, g_idx)
    qlayers[name].to(layer_device)
    print(f"Time for {name}: {time.time() - start}")

with ThreadPoolExecutor() as executor:
    executor.map(pack_layer, qlayers.keys())

The timings are Time for transformer.blocks.0.attn.Wqkv: 0.21 Time for transformer.blocks.0.attn.out_proj: 0.10 Time for transformer.blocks.0.ffn.down_proj: 0.22 Time for transformer.blocks.0.ffn.up_proj: 3.64

Expected behavior

Whats strange is that the packing time is so long, taking ~ 90 seconds to pack the up projection. Whats even stranger is that the individual packing times per layer are lower when run in parallel (not just the overall time). Ie when run sequentially the time to pack the up proj is 90 seconds but this goes down to 3 seconds when running packing each layer in parallel.

Qubitium commented 2 months ago

@zankner I have fixed this in both GPTQModel and back-ported fix to main of AutoGPTQ.

zankner commented 2 months ago

Awesome thanks!

zankner commented 2 months ago

@Qubitium sorry to re-open but still needs to be ported into optimum.