intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.15k stars 252 forks source link

LLM smoothquant, how to add a customer evaluate func ? #1999

Closed tianylijun closed 10 hours ago

tianylijun commented 1 week ago

version: tag V3.0 pytorch: examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/smooth_quant/run_clm_no_trainer.py

  1. LLM smoothquant, how to add a customer evaluate func,current seems only given task can be used, how to add a customer evaluate for new dataset ?

    from intel_extension_for_transformers.transformers.llm.evaluation.lm_eval import evaluate, LMEvalParser
    eval_args = LMEvalParser(
        model="hf",
        user_model=user_model,
        tokenizer=tokenizer,
        batch_size=args.batch_size,
        **tasks=args.tasks, <---- only fix task** 
        device="cpu",
    )
    results = evaluate(eval_args)
  2. run_fn for calibration called by autotun, only do one round forward for a promot, generated token not used, generated token not contribute to fmap distribution?

    def run_fn(model):
        calib_iter = 0
        for batch in tqdm(calib_dataloader):
            batch = move_input_to_device(batch, device="cpu")
            if isinstance(batch, tuple) or isinstance(batch, list):
                **model(batch[0]) <-- only forward once for each promote in calib_dataloader generated token not used**
            elif isinstance(batch, dict):
                model(**batch)
            else:
                model(batch)
            calib_iter += 1
            if calib_iter >= args.calib_iters:
                break
        return
    
    user_model = autotune(
        user_model, 
        tune_config=tune_config,
        eval_fn=eval_func,
        run_fn=run_fn,
        example_inputs=example_inputs,
    )
xin3he commented 15 hours ago

Hi @tianylijun, For the Q1, tasks supported here inherit from this popular repo EleutherAI/lm-evaluation-harness. Usually, these tasks cover what you need. As to Q2, calibration has no need to record the generated token. The data distribution is recorded during inference by each layer. If you have any other questions, please let me know.