intel / neural-compressor

SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
https://intel.github.io/neural-compressor/
Apache License 2.0
2.14k stars 249 forks source link

Any example to quantise a text embedding model on Intel Gaudi2? #1919

Open sleepingcat4 opened 1 month ago

sleepingcat4 commented 1 month ago

I was looking for example or documentation how I can load or quantise both a HF embedding model on Intel Gaudi2. is there any examples available? I don't want to use docker btw

NeoZhangJianyu commented 1 month ago

@sleepingcat4 Please refer to: https://github.com/intel/neural-compressor/tree/bfa27e422dc4760f6a9b1783eee7dae10fe5324f/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/habana_fp8.

sleepingcat4 commented 1 month ago

thank you! I will experiment with it tomorrow