add pad_to_buckets in evaluation for hpu performance

intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

Apache License 2.0

2.23k stars 257 forks source link

Type of Change

lm_eval evaluation enhancement

Description

buckets = [64, 128, 256, 512, 1024, 2048, 4096, 8192] This change will pad the input length to the nearest upper bound in buckets, which will avoid creating many graphs for different lengths to leverage the HPU accelerator.

[x] pad_to_buckets is a new argument for LMEvalParser
[x] auto device is enabled if pad_to_buckets is not set.
[x] add hpu evaluation support in example.
[x] install evaluation when install neural_compressor_pt.

Expected Behavior & Potential Risk

lm_eval example test gives the same result as before.

intel / neural-compressor