intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.45k stars 1.24k forks source link

Optimize the model used in Ant group for inference #8998

Open qzheng527 opened 11 months ago

qzheng527 commented 11 months ago

There is a BERT based model used in Ant group for inference on geo similarity compare. https://modelscope.cn/models/damo/mgeo_geographic_entity_alignment_chinese_base/summary https://modelscope.cn/models/damo/mgeo_geographic_entity_alignment_chinese_base/files

It can run successfuly by Occlum in SGX server. We expect the BigDL team can help optimize the model to achieve better inference performance.

shane-huang commented 11 months ago

You may refer to our example about how to accelerate bert https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/pytorch-models/bert.

qzheng527 commented 11 months ago

@shane-huang Is there a way to do the model optimization first and save to a file? Then the following inference could just use the optimized model directly. BTW, can BigDL support onnx model optimization?

qiyuangong commented 11 months ago

@shane-huang Is there a way to do the model optimization first and save to a file? Then the following inference could just use the optimized model directly. BTW, can BigDL support onnx model optimization?

Hi @qzheng527 Yes. We still support optimizing models in specific model formats, especially in BigDL Nano. The APIs shared by Shane are a new solution to avoid offline model optimization (convert, quantization, etc.). We can improve model performance without converting.

ONNX format is still supported by BigDL Nano. You can find more details in https://bigdl.readthedocs.io/en/latest/doc/Nano/QuickStart/pytorch_onnxruntime.html