intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.23k stars 257 forks source link

add pad_to_buckets in evaluation for hpu performance #2011

Closed xin3he closed 1 month ago

xin3he commented 1 month ago

Type of Change

lm_eval evaluation enhancement

Description

buckets = [64, 128, 256, 512, 1024, 2048, 4096, 8192] This change will pad the input length to the nearest upper bound in buckets, which will avoid creating many graphs for different lengths to leverage the HPU accelerator.

Expected Behavior & Potential Risk

lm_eval example test gives the same result as before.

xin3he commented 1 month ago

https://inteltf-jenk.sh.intel.com/job/INC_LLM_accuracy/137/artifact/report.html align with master branch: https://inteltf-jenk.sh.intel.com/job/INC_LLM_accuracy/107/artifact/report.html