Integrate ITREX to support popular compression algorithms and highly optimized kernels

yiliu30 commented 1 year ago

First PR for https://github.com/stochasticai/xTuring/issues/264

Usage

from xturing.models import BaseModel

# Sepcific the quantizatuion configuration
from intel_extension_for_transformers.transformers import WeightOnlyQuantConfig
woq_config = WeightOnlyQuantConfig(weight_dtype='int8')
model = BaseModel.create("gpt2", quantization_config=woq_config)

# Inference model with itrex's highly optimized kernels
output = model.generate(texts=["Why are the LLM models important?"])

TODO

[x] Support gpt2
[ ] Extend support to all models
[ ] UT and CI
[ ] Doc

@StochasticRomanAgeev @tushar2407

StochasticRomanAgeev commented 1 year ago

Hi @yiliu30, Thanks for pr! First question, what is better in this approach than our already supported int8 version of models?

yiliu30 commented 1 year ago

Created https://github.com/stochasticai/xTuring/pull/268 for this integration, close it first.

stochasticai / xTuring

Integrate ITREX to support popular compression algorithms and highly optimized kernels #263

Usage

TODO