microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.68k stars 2.93k forks source link

Onnxruntime.gpu is as slower than cpu mode #6799

Open SabraHashemi opened 3 years ago

SabraHashemi commented 3 years ago

onnxruntime 1.2.0

Model loaded Time taken for Pytoch model 0:00:00.975350 Output size torch.Size([1, 112, 112, 2]) Model ran sucesfully Model converted succesfully Model checked succesfully CPU Time taken for Onnx model 0:00:00.178522

with onnxruntime gpu 1.2.0

Model loaded Time taken for Pytoch model 0:00:00.755978 Output size torch.Size([1, 112, 112, 2]) Model ran sucesfully Model converted succesfully Model checked succesfully GPU Time taken for Onnx model 0:00:00.617351

i read all other similar ithreds but the problem not solved or clear

SabraHashemi commented 3 years ago

my output for this script : Time taken for Onnx model 0:00:00.250296


import time from datetime import datetime import torch import cv2 import onnxruntime as rt

import craft_utils import imgproc

sess = rt.InferenceSession("craft.onnx") input_name = sess.get_inputs()[0].name print( rt.get_device() ) first_output_name = sess.get_outputs()[0].name

print('\n') print('\n') print('input_name',input_name) print('output_name',first_output_name) print('\n') print('\n')

img = cv2.imread('./data/1.jpg') img_resized, target_ratio, size_heatmap = imgproc.resize_aspect_ratio(img, 1280, interpolation=cv2.INTER_LINEAR, mag_ratio=1.5) ratio_h = ratio_w = 1 / target_ratio

print(ratio_h, ratio_w)

x = imgproc.normalizeMeanVariance(img_resized) x = torch.from_numpy(x).permute(2, 0, 1) # [h, w, c] to [c, h, w] x = x.unsqueeze(0) # [c, h, w] to [b, c, h, w]

t1 = datetime.now() y, _ = sess.run(None, {input_name: x.numpy()}) t2 = datetime.now() print("Time taken for Onnx model", str(t2-t1))

make score and link map

score_text = y[0, :, :, 0] score_link = y[0, :, :, 1]

Post-processing

boxes,polys = craft_utils.getDetBoxes(score_text, score_link, 0.5, 0.3, 0.3,True)

print(boxes)

boxes = craft_utils.adjustResultCoordinates(boxes, ratio_w, ratio_h) polys = craft_utils.adjustResultCoordinates(polys, ratio_w, ratio_h) print(boxes)

tianleiwu commented 3 years ago

@sabrabano0, could you try the following: (1) Use a warm up query before measuring latency. That is, exclude the first call of sess.run(...). (2) After warming up, send N (like N=1000 of calls sess.run) and get statistics (like average) of latency. (3) Try IO Binding. You used API that need copy input tensors to GPU and copy output tensors to CPU. If you included those IO time, it might not be fair to compare with Torch (since input/output are in GPU for Torch) or CPU provider (since it does not need such IO).

purvang3 commented 3 years ago

@sabrabano0, could you try the following: (1) Use a warm up query before measuring latency. That is, exclude the first call of sess.run(...). (2) After warming up, send N (like N=1000 of calls sess.run) and get statistics (like average) of latency. (3) Try IO Binding. You used API that need copy input tensors to GPU and copy output tensors to CPU. If you included those IO time, it might not be fair to compare with Torch (since input/output are in GPU for Torch) or CPU provider (since it does not need such IO).

1st suggestion should fix issue.