Integrate ITREX to support int8 model on the CPU-only devices

stochasticai / xTuring

Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6

https://xturing.stochastic.ai

Apache License 2.0

2.61k stars 207 forks source link

Integrate ITREX to support int8 model on the CPU-only devices #268

Closed yiliu30 closed 1 year ago

yiliu30 commented 1 year ago

1st PR for https://github.com/stochasticai/xTuring/issues/264 to integrate Intel-Extension-for-Transformers to support int8 model on the CPU-only devices.

Usage

from xturing.models import BaseModel

# Auto detect the device and load int8 with Itrex if cpu-only
model = BaseModel.create("llama2_int8")

# Inference model with Itrex's highly optimized kernels
output = model.generate(texts=["Why are the LLM models important?"])
print(output)

TODO

[x] Load model with Itrex for CPU-only device
[x] Add example (gpt2, llama2)
[ ] Add UT (WIP)
[ ] Doc

@StochasticRomanAgeev

yiliu30 commented 1 year ago

Hi @StochasticRomanAgeev, could you please take the time to conduct a preliminary review? thanks :)

yiliu30 commented 1 year ago

Hi @StochasticRomanAgeev @tushar2407 @MarcosRiveraMartinez, could you please take your time to review the PR, thanks :)

StochasticRomanAgeev commented 1 year ago

Hi @yiliu30, Done merging, thanks for doing this integration!

StochasticRomanAgeev commented 1 year ago

It would be great if you could include a brief description of update into CPU inference section in the README. I intend to provide this enhancement in a subsequent PR.

yiliu30 commented 1 year ago

It would be great if you could include a brief description of update into CPU inference section in the README. I intend to provide this enhancement in a subsequent PR.

Sure, I'll add an introduction in a follow-up PR soon :)