Open hdnh2006 opened 1 year ago
I was finally able to run it locally using the docker container from ultralytics
However I just get and optimization (according to your library) of 1.35x faster:
This optimization is not faster than the normal tensorRT provided by ultralytics:
python export.py --include engine --device 0
is there something I am missing?
Hi @hdnh2006,
thank you for the contribution. Happy to assist you and accelerate your model together.
We ran again some tests on Yolov5 in this colab and got the following results:
And regarding your first message
About local testing:
optimize_model
function as well?Thanks @diegofiori for your cooperation.
This notebook is not comparing the optimized model after applying the following code:
class OptimizedYolo(torch.nn.Module):
def __init__(self, optimized_core, head_layer):
super().__init__()
self.core = optimized_core
self.head = head_layer
def forward(self, x, *args, **kwargs):
x = list(self.core(x)) # it's a tuple
return self.head(x)
final_core = OptimizedYolo(model_optimized, last_layer)
model.model.model = final_core
Why is like this? Maybe here is my mistake. Because I am applying my code after these lines of code:
import numpy as np
dummy_input = torch.randn(1, 3, 384, 640, dtype=torch.float).to(device)
# INIT LOGGERS
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 100
no_data_write_timings=np.zeros((repetitions,1))
#GPU-WARM-UP
for _ in range(10):
_ = model(dummy_input)
# MEASURE PERFORMANCE WITHOUT DATA TRANSFERS
with torch.no_grad():
for rep in range(repetitions):
starter.record()
#dummy_input_on_device = dummy_input.to(device)
outputs = model(dummy_input)
ender.record()
# WAIT FOR GPU SYNC
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)
no_data_write_timings[rep] = curr_time
mean_no_data_write_syn = np.sum(no_data_write_timings) / repetitions
std_no_data_write_syn = np.std(no_data_write_timings)
print('Optimized model results WITHOUT data transfers:')
print('The Optimized model mean batch inference time is:' + str(mean_no_data_write_syn))
print('The Optimized model std batch inference time is:' + str(std_no_data_write_syn))
Then, what is the final model? I mean, I want to replace this line of code: https://github.com/ultralytics/yolov5/blob/cdd804d39ff84b413bde36a84006f51769b6043b/detect.py#L98
For your optimized model, what should I put to replace it?
Tried locally with RTX2060 and these results are gotten:
Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.3865331208705902 The Optimized model by Ultralytics std batch inference time is:0.011859980193208422
Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.081862072944641 The Optimized model std batch inference time is:0.14880756351818578
😞😞😞😞😞
I tried again with your modified Google Colab notebook and it is true that is 2x faster than the original PyTorch version but it is slower than the TensorRT provided by Ultralytics:
Original model results WITHOUT data transfers: The Original model mean batch inference time is:7.7102195215225215 The Original model std batch inference time is:0.967732421795685
Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.6612223982810974 The Optimized model std batch inference time is:0.11819371827359895
Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.5688025617599488 The Optimized model by Ultralytics std batch inference time is:0.03936980002624129
Check the notebook I modified here: https://colab.research.google.com/drive/1Nde0tCx28g3BTe2nxfLhTCMgIreqcvtw?usp=sharing
Hello @hdnh2006 ,
On the different performance respect to the Ultralytics implementation I think this can be due to the input your are giving to the optimize_model
function. In fact, when the metric_drop_ths
parameter is not given, speedster by default keeps the model in full precision 32 bits
. Speedster supports both fp16
and int8
precisions but you have to activate them passing the metric_drop_ths parameter to the optimize_model
function.
With fp16 precision in speedster I am getting 1.187 ms of inference time. I'm waiting for the int8 result.
Ok, so what are the values I should put to get fp16 precision?
I can see in the documentation that metric_drop_ths
is a float number:
https://github.com/nebuly-ai/nebullvm/blob/8aacdd7593746fd3cb71e6575847f028c9f6193d/apps/accelerate/speedster/speedster/api/functions.py#L86
model_optimized = optimize_model(
model=core_wrapper,
input_data=input_data,
optimization_time="unconstrained",
metric_drop_ths=0.1
)
I tried on TeslaV100 32GB and these are the results obtained:
model_optimized = optimize_model(
model=core_wrapper,
input_data=input_data,
optimization_time="unconstrained",
metric_drop_ths=0.1
)
2023-01-12 19:11:22 | INFO | Running Speedster on GPU
2023-01-12 19:11:23 | WARNING | Missing Frameworks: tensorflow.
Please install them to include them in the optimization pipeline.
2023-01-12 19:11:25 | INFO | Benchmark performance of original model
2023-01-12 19:11:26 | INFO | Original model latency: 0.005212109088897705 sec/iter
2023-01-12 19:11:27 | INFO | Optimizing with PytorchBackendCompiler and q_type: None.
2023-01-12 19:11:29 | INFO | Optimized model latency: 0.0037038326263427734 sec/iter
2023-01-12 19:11:29 | INFO | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:11:29 | WARNING | Unable to trace model with torch.fx
2023-01-12 19:11:31 | INFO | Optimized model latency: 0.0037364959716796875 sec/iter
2023-01-12 19:11:31 | INFO | Optimizing with ONNXCompiler and q_type: None.
2023-01-12 19:11:33 | INFO | Optimized model latency: 0.006529092788696289 sec/iter
2023-01-12 19:11:33 | INFO | Optimizing with ONNXCompiler and q_type: QuantizationType.DYNAMIC.
2023-01-12 19:11:40 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:11:40 | INFO | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:11:43 | INFO | Optimized model latency: 0.005489349365234375 sec/iter
2023-01-12 19:11:43 | INFO | Optimizing with ONNXCompiler and q_type: QuantizationType.STATIC.
2023-01-12 19:11:58 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:11:58 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-01-12 19:12:22 | INFO | Optimized model latency: 0.00531458854675293 sec/iter
2023-01-12 19:12:22 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:13:43 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:13:43 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-01-12 19:16:53 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
[ Speedster results on GPU]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Original Model ┃ Optimized Model ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━┫
┃ backend ┃ PYTORCH ┃ TorchScript ┃
┃ latency ┃ 0.0052 sec/batch ┃ 0.0037 sec/batch ┃
┃ throughput ┃ 191.86 data/sec ┃ 269.99 data/sec ┃
┃ model size ┃ 35.18 MB ┃ 28.38 MB ┃
┃ metric drop (compute_relative_difference) ┃ ┃ 0 ┃
┃ speedup ┃ ┃ 1.41x ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┛
Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.505920317173004 The Optimized model std batch inference time is:0.1691654167164366
Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.442447043657303 The Optimized model by Ultralytics std batch inference time is:0.05623926929897324
Definitely, there's something I am doing wrong but I am following all the steps you provide.
Anyway, as you can see in this notebook, the same happens on Tesla T4
Ciao Diego,
I have tried your solution in several environments but it seems it is hard to keep correctly all version of packages.
I finally could run your notebook for yolov5 in Google Colab but I can't see any improvement using my method to measure performance, which is using the following code:
And unfortunately it has been really hard to run it locally and I still am getting several errors, for example while I try to install your library I get:
When I try to run the way your measure performance:
Because I get the following error:
So here we are several things:
torch.cuda.synchronize()
but in Google Colab the best model is based on TensorRT.I think the best way is to try your optimization using a container, do you have some container where we can try speedster?
Thanks in advance.