Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Nano currently has a great collection of pytorch/tf inference acceleration methods, while our users might need an automatical (intellectural guided) pipeline to find which one is the best.
Methodology
Currently we can have 3 workflows for our users who cares about the inference performance.
Method
Accuracy Drop
Expected Acceleration Ratio
Retrain
Success Ratio
Trainer.quantize
True (except bf16)
1~4X
False
low
Trainer.trace
False
1~2X
False
high
Trainer.search
True
1~20X
True
medium
For this pipeline design, we will classify users to 2 catagories
Users bring a trained model (maybe loaded from checkpoint files), and would like to optimize this specific model in a short time.
We will have a new API to find the best accelearted model for our users automatically, detailed API design is illustrated below. This new API(bigdl.nano.pytorch.Optimizer) will cover the original Trainer.quantize and Trainer.trace.
Some candidate name since optimizer might be confused with pytorch's optimizer (e.g. adam):
AutoAccelaration
AutoOptimizer
InferenceOptimizer
Users have a model definition and would like to find the best hyperparameter configuration to balance the accuracy and latency.
Trainer.search will handle this case very easily, we will not cover this part carefully in this issue.
A new workflow could be
flowchart TD
StartPoint[I want to make this model meet my performance requirement]
StartPoint -- trained model --> bigdl.nano.pytorch.InferenceOptimizer
StartPoint -- willing to retrain the model --> bigdl.nano.pytorch.Trainer.search
bigdl.nano.pytorch.Trainer.search -- further optimize the model --> bigdl.nano.pytorch.InferenceOptimizer
bigdl.nano.pytorch.InferenceOptimizer-- I know what method to use --> InferenceOptimizer.trace/quantize
bigdl.nano.pytorch.InferenceOptimizer-- find the best method for me --> InferenceOptimizer.optimize
InferenceOptimizer.optimize -- get model for inferencing --> InferenceOptimizer.get_best_model
InferenceOptimizer.get_best_model -- export for serving --> bigdl.nano.pytorch.Trainer.save
InferenceOptimizer.trace/quantize -- export for serving --> bigdl.nano.pytorch.Trainer.save
bigdl.nano.pytorch.Trainer.save -- load back --> bigdl.nano.pytorch.Trainer.load
Among all of these functionalities:
Optimizer.trace/quantize: We can just use same API design and implementation in Trainer.trace/quantize and leave Trainer.trace/quantize unchanged as a legancy API.
Optimizer.optimize: This is a comprehensive API for post-training optimizations. detailed design and a prototye design is listed below.
Optimizer.get_best_model: new API to find the best model that meet the criterial.
Other APIs are remain the same.
API Design
Please find a prototype implementation in: #5336
This API is designed to be
really easy to use without extra parameters
detailed acceleration strategy is completely hidden to our users.
# bigdl.nano.pytorch.inference.Optimizer
def optimize(model,
training_data,
validation_data=None,
metric=None,
direction=None,
cpu_num=None):
'''
This function will give all available inference acceleration methods a try
and record the latency, accuracy and model instance inside the Optimizer for
future usage.
:param model: A nn.module to be optimized
:param training_data: A pytorch dataloader for training dataset.
Users should be careful with this parameter since this dataloader
might be exposed to the model, which causing data leak. The
batch_size of this dataloader is important as well, users may
want to set it to the same batch size you may want to use the model
in real deploy environment. E.g. batch size should be set to 1
if you would like to use the accelerated model in an online service.
:param validation_data: (optional) A pytorch dataloader for accuracy evaluation
This is only needed when users care about the possible accuracy drop.
:param metric: (optional) A callable object takes prediction and target
and returns a accuracy value in this calling method `metric(pred, target)`
:param direction: (optional) A string that indicates the higher/lower
better for the metric, "min" for the lower the better and "max" for the
higher the better.
:param cpu_num: (optional) a int represents how many cores is needed for
inference.
'''
# psedo-code:
# available_methods = _check_acceleration_methods_dependencies()
# for method in available_methods:
# accelerated_model = method(model)
# performance = evaluate_performance(accelerated_model, training_data)
# accuracy = evaluate_performance(accelerated_model, validation_data, metric)
# if accracy meets requirement and performance is smaller:
# model_to_be_return = accelerated_model
# return model_to_be_return
def get_best_model(accelerator=None,
precision=None,
allow_acc=None,):
'''
:param accelerator: (optional) if not None, then will only find the
model with this specific accelerator.
:param precision: (optional) if not None, the will only find the
model with thie specific precision.
:param use_ipex: (optional) if not NOne, then will only find the
model with this specific ipex setting
:param allow_acc: (optional) a float represents the accuracy threshold
that can be tollerated.
:return: best model
'''
Some demo calling
A user who does not care about the accuracy drop and cares about the single sample inferece speed may call this function like this.
Background
Nano currently has a great collection of pytorch/tf inference acceleration methods, while our users might need an automatical (intellectural guided) pipeline to find which one is the best.
Methodology
Currently we can have 3 workflows for our users who cares about the inference performance.
Trainer.quantize
Trainer.trace
Trainer.search
For this pipeline design, we will classify users to 2 catagories
Users bring a trained model (maybe loaded from checkpoint files), and would like to optimize this specific model in a short time.
We will have a new API to find the best accelearted model for our users automatically, detailed API design is illustrated below. This new API(
bigdl.nano.pytorch.Optimizer
) will cover the originalTrainer.quantize
andTrainer.trace
.Some candidate name since optimizer might be confused with pytorch's optimizer (e.g. adam):
Users have a model definition and would like to find the best hyperparameter configuration to balance the accuracy and latency.
Trainer.search
will handle this case very easily, we will not cover this part carefully in this issue.A new workflow could be
Among all of these functionalities:
Optimizer.trace/quantize
: We can just use same API design and implementation inTrainer.trace/quantize
and leaveTrainer.trace/quantize
unchanged as a legancy API.Optimizer.optimize
: This is a comprehensive API for post-training optimizations. detailed design and a prototye design is listed below.Optimizer.get_best_model
: new API to find the best model that meet the criterial.Other APIs are remain the same.
API Design
Please find a prototype implementation in: #5336
This API is designed to be
really easy to use without extra parameters
detailed acceleration strategy is completely hidden to our users.
Some demo calling
A user who does not care about the accuracy drop and cares about the single sample inferece speed may call this function like this.
A user who has strict accuracy requirement and cares about a large batch's inference speed may call function like this:
TODO
after #5336
[x] implement
direction
inoptimizer.optimize
[x] implement
cpu_num
inoptimizer.optimize
[x] implement
optimizer.get_best_model
[ ] migrate
optimizer.trace
[ ] migrate
optimizer.quantize
[ ] example to show the full ppl