patil-suraj / onnx_transformers

Accelerated NLP pipelines for fast inference on CPU. Built with Transformers and ONNX runtime.
Apache License 2.0
125 stars 27 forks source link
inference nlp onnx onnxruntime transformers

onnx_transformers

onnx_transformers

Accelerated NLP pipelines for fast inference 🚀 on CPU. Built with 🤗Transformers and ONNX runtime.

Installation:

pip install git+https://github.com/patil-suraj/onnx_transformers

Usage:

NOTE : This is an experimental project and only tested with PyTorch

The pipeline API is similar to transformers pipeline with just a few differences which are explained below.

Just provide the path/url to the model and it'll download the model if needed from the hub and automatically create onnx graph and run inference.

from onnx_transformers import pipeline

# Initialize a pipeline by passing the task name and 
# set onnx to True (default value is also True)
>>> nlp = pipeline("sentiment-analysis", onnx=True)
>>> nlp("Transformers and onnx runtime is an awesome combo!")
[{'label': 'POSITIVE', 'score': 0.999721109867096}]  

Or provide a different model using the model argument.

from onnx_transformers import pipeline

>>> nlp = pipeline("question-answering", model="deepset/roberta-base-squad2", onnx=True)
>>> nlp({
  "question": "What is ONNX Runtime ?", 
  "context": "ONNX Runtime is a highly performant single inference engine for multiple platforms and hardware"
})
{'answer': 'highly performant single inference engine for multiple platforms and hardware', 'end': 94, 'score': 0.751201868057251, 'start': 18}

Set onnx to False for standard torch inference.

You can create Pipeline objects for the following down-stream tasks:

Calling the pipeline for the first time loads the model, creates the onnx graph, and caches it for future use. Due to this, the first load will take some time. Subsequent calls to the same model will load the onnx graph automatically from the cache.

The key difference between HF pipeline and onnx_transformers is that the model parameter should always be a string (path or url to the saved model). Also, the zero-shot-classification pipeline here uses roberta-large-mnli as default model instead of facebook/bart-large-mnli as BART is not yet tested with onnx runtime.

Benchmarks

Note: For some reason, onnx is slow on colab notebook so you won't notice any speed-up there. Benchmark it on your own hardware.

For detailed benchmarks and other information refer to this blog post and notebook.

To benchmark the pipelines in this repo, see the benchmark_pipelines notebook.

(Note: These are not yet comprehensive benchmarks.)

Benchmark feature-extraction pipeline

Benchmark question-answering pipeline