How to generate and perform inference for an ONNX model

ghost commented 1 year ago

Thanks for the awesome work! currently, I've been struggling with an issue while working with speedster which I will lay out below: 1. I've been able to optimize onnx model ( from HuggingFace, and is based on Donut https://github.com/clovaai/donut )

code used:

import numpy as np
from speedster import optimize_model
from speedster import save_model
import numpy as np
import torch
import os

Provide input data for the model
input_data = [((np.array(torch.randn(5, 3),dtype=np.int64), np.array(torch.randn(5, 3, 1024),dtype=np.float32), ), torch.tensor([0, 1, 0, 1, 1])) for _ in range(100)]

Run Speedster optimization
optimized_model = optimize_model(
    "./models/onnx/decoder_model.onnx",
    input_data=input_data,
    optimization_time="unconstrained",
    device="gpu:0",
    metric_drop_ths=0.8
)

save_model(optimized_model, "./models/speedster")

output:

2023-07-19 14:22:43 | INFO     | Running Speedster on GPU:0
2023-07-19 14:25:33 | INFO     | Benchmark performance of original model
2023-07-19 14:26:10 | INFO     | Original model latency: 0.023933820724487305 sec/iter
2023-07-19 14:26:11 | INFO     | [1/1] Running ONNX Optimization Pipeline
2023-07-19 14:26:11 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-07-19 14:26:14 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-07-19 14:26:16 | INFO     | Optimized model latency: 0.02505326271057129 sec/iter
2023-07-19 14:26:16 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:26:44 | INFO     | Optimized model latency: 0.3438906669616699 sec/iter
2023-07-19 14:26:44 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-07-19 14:28:18 | INFO     | Optimized model latency: 0.004456996917724609 sec/iter
2023-07-19 14:28:18 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-07-19 14:28:51 | INFO     | Optimized model latency: 0.003861665725708008 sec/iter
2023-07-19 14:28:51 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-07-19 14:33:56 | INFO     | Optimized model latency: 0.004480838775634766 sec/iter

[Speedster results on Tesla V100-SXM2-16GB]
Metric       Original Model    Optimized Model    Improvement
-----------  ----------------  -----------------  -------------
backend      NUMPY             TensorRT
latency      0.0239 sec/batch  0.0039 sec/batch   6.20x
throughput   208.91 data/sec   1294.78 data/sec   6.20x
model size   743.98 MB         254.43 MB          -65%
metric drop                    0.5291
techniques                     fp16

I am just hitting a wall when trying to perform inference. code used:

from speedster import load_model
from nebullvm.tools.benchmark import benchmark
import numpy
import tensorflow as tf

optimized_model = load_model("../opt/models/speedster/")
print('speedster onnx model loaded')

device = "cuda" if torch.cuda.is_available() else "cpu"
dummy_input = torch.randn(1, 3, 300, 400, dtype=torch.float).to(device)
print(type(dummy_input))

output = optimized_model(dummy_input)
print(output)

observation:

2023-07-19 14:35:43 | WARNING  | Debug: Got extra keywords in NvidiaInferenceLearner::from_engine_path: {'class_name': 'NumpyONNXTensorRTInferenceLearner', 'module_name': 'nebullvm.operations.inference_learners.tensor_rt'}
speedster onnx model loaded
<class 'torch.Tensor'>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
[<ipython-input-9-ea33d0034b2d>](https://localhost:8080/#) in <cell line: 20>()
     18 
     19 # Use the accelerated version of your ONNX model in production
---> 20 output = optimized_model(dummy_input)
     21 print(output)

5 frames
[/usr/local/lib/python3.10/dist-packages/polygraphy/cuda/cuda.py](https://localhost:8080/#) in dtype(self, new)
    296     def dtype(self, new):
    297         self._dtype = new
--> 298         self.itemsize = np.dtype(new).itemsize
    299 
    300     @property

TypeError: Cannot interpret 'torch.float32' as a data type

So my question would be what are the types of parameters I did to include for optimized_model() method here . Previously, I've been passing the following to original model to get it working

def run_prediction(test_sample, model=model, processor=processor):
    pixel_values = processor(test_sample, return_tensors="pt").pixel_values
    task_prompt = "<s>"
    decoder_input_ids = processor.tokenizer(task_prompt, add_special_tokens=False, return_tensors="pt").input_ids
    outputs = model.generate(
        pixel_values.to(device),
        decoder_input_ids=decoder_input_ids.to(device),
        max_length=model.decoder.config.max_position_embeddings,
        early_stopping=True,
        pad_token_id=processor.tokenizer.pad_token_id,
        eos_token_id=processor.tokenizer.eos_token_id,
        use_cache=False,
        num_beams=1,
        bad_words_ids=[[processor.tokenizer.unk_token_id]],
        return_dict_in_generate=True,
    )
    prediction = processor.batch_decode(outputs.sequences)[0]
    prediction = processor.token2json(prediction)
    return prediction

Please let me know if you require additional information. thanks.

ghost commented 1 year ago

been able to make some progress with optimizing ONNX model, but getting some errors when reaching the speedster optimization stage.. Please also find my collab link below > https://github.com/dneemuth/saversbasket/blob/main/Optimizing_Transformers_Speedster.ipynb

ghost commented 1 year ago

@mfumanelli sorry for interruption, but I was hoping you can point me in the right direction. been struggling with an issue while trying to optimize onnx model via speedster. I might be doing something wrong here. I already have my script to replicate the issue on my google colab account if you want to have a look. thanks.

https://colab.research.google.com/drive/1eHYU0dKcM-ms3oL2pH6YWQ_qrWDSLWYH?usp=sharing

nebuly-ai / optimate

How to generate and perform inference for an ONNX model #350