Closed leoozy closed 3 years ago
I eport the hugging face bart model with torch.export() and get an onnx model.
And then optimize it with:
onnxpath="/home/sysadmin/downlaod/onnx_models/superreducednmbart/model.onnx" modelpath=/home/sysadmin/downlaod/onnx_models/optmbarts.onnx python -m onnxruntime.transformers.optimizer --input ${onnxpath} --output ${modelpath} --float16 --use_gpu --opt_level 99 --model_type gpt2 --input_int32
And inference it with:
def evaluate_onnx(model_path):
execution_providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] sess_options = onnxruntime.SessionOptions() sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL session = onnxruntime.InferenceSession(model_path, sess_options, providers=execution_providers) assert 'CUDAExecutionProvider' in session.get_providers() for i in range(20): start = time.time() session.run(None, input_ort) latency1.append(time.time() - start) print(latency1)
the version of onnxruntime and pytorch are all the latest. The platform is A100 + linux.
latency for hugging face model : 0.02s latency for optimized_onnx model: 0,04s
I noticed that when the sequence length is more than 200, the optimized_onnx is faster. But for larger sequence lengh, the hugging face model is larger
I eport the hugging face bart model with torch.export() and get an onnx model.
And then optimize it with:
onnxpath="/home/sysadmin/downlaod/onnx_models/superreducednmbart/model.onnx" modelpath=/home/sysadmin/downlaod/onnx_models/optmbarts.onnx python -m onnxruntime.transformers.optimizer --input ${onnxpath} --output ${modelpath} --float16 --use_gpu --opt_level 99 --model_type gpt2 --input_int32
And inference it with:
def evaluate_onnx(model_path):
the version of onnxruntime and pytorch are all the latest. The platform is A100 + linux.
latency for hugging face model : 0.02s latency for optimized_onnx model: 0,04s