nebuly-ai / optimate

A collection of libraries to optimise AI model performances
https://www.nebuly.com/
Apache License 2.0
8.38k stars 639 forks source link

Is there a way to try speedster in docker container? #148

Open hdnh2006 opened 1 year ago

hdnh2006 commented 1 year ago

Ciao Diego,

I have tried your solution in several environments but it seems it is hard to keep correctly all version of packages.

I finally could run your notebook for yolov5 in Google Colab but I can't see any improvement using my method to measure performance, which is using the following code:

import numpy as np

dummy_input = torch.randn(1, 3, 640, 640, dtype=torch.float).to(device)

# INIT LOGGERS
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 100
no_data_write_timings=np.zeros((repetitions,1))

#GPU-WARM-UP
for _ in range(10):
    _ = model(dummy_input)
# MEASURE PERFORMANCE WITHOUT DATA TRANSFERS
with torch.no_grad():
    for rep in range(repetitions):
        starter.record()
        #dummy_input_on_device = dummy_input.to(device)
        outputs = model(dummy_input)
        ender.record()
        # WAIT FOR GPU SYNC
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)
        no_data_write_timings[rep] = curr_time

mean_no_data_write_syn = np.sum(no_data_write_timings) / repetitions
std_no_data_write_syn = np.std(no_data_write_timings)

print('Optimized model results WITHOUT data transfers:')
print('The Optimized model mean batch inference time is:' +  str(mean_no_data_write_syn))
print('The Optimized model std batch inference time is:' +  str(std_no_data_write_syn))

And unfortunately it has been really hard to run it locally and I still am getting several errors, for example while I try to install your library I get:

2023-01-12 14:00:14 | WARNING  | Unable to install tensor_rt on this platform. The compiler will be skipped. 
2023-01-12 14:00:14 | INFO     | Trying to install deepsparse on the platform...

When I try to run the way your measure performance:

times = []
for _ in range(100):
    st = time.time()
    results = model("zidane.jpg") #imgs[0] got the same error
    times.append((time.time() - st)*1000)
yolo_optimized_time = sum(times) / len(times)
print(f"Average prediction time: {yolo_optimized_time} ms")

Because I get the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[30], line 4
      2 for _ in range(100):
      3     st = time.time()
----> 4     results = model("zidane.jpg")
      5     times.append((time.time() - st)*1000)
      6 yolo_optimized_time = sum(times) / len(times)

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File ~/.cache/torch/hub/ultralytics_yolov5_master/models/common.py:705, in AutoShape.forward(self, ims, size, augment, profile)
    702 with amp.autocast(autocast):
    703     # Inference
    704     with dt[1]:
--> 705         y = self.model(x, augment=augment)  # forward
    707     # Post-process
    708     with dt[2]:

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.cache/torch/hub/ultralytics_yolov5_master/models/common.py:515, in DetectMultiBackend.forward(self, im, augment, visualize)
    512     im = im.permute(0, 2, 3, 1)  # torch BCHW to numpy BHWC shape(1,320,192,3)
    514 if self.pt:  # PyTorch
--> 515     y = self.model(im, augment=augment, visualize=visualize) if augment or visualize else self.model(im)
    516 elif self.jit:  # TorchScript
    517     y = self.model(im)

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

Cell In[20], line 9, in OptimizedYolo.forward(self, x, *args, **kwargs)
      7 def forward(self, x, *args, **kwargs):
      8     x = list(self.core(x)) # it's a tuple
----> 9     return self.head(x)

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.cache/torch/hub/ultralytics_yolov5_master/models/yolo.py:59, in Detect.forward(self, x)
     57 z = []  # inference output
     58 for i in range(self.nl):
---> 59     x[i] = self.m[i](x[i])  # conv
     60     bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
     61     x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
    462 def forward(self, input: Tensor) -> Tensor:
--> 463     return self._conv_forward(input, self.weight, self.bias)

File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
    455 if self.padding_mode != 'zeros':
    456     return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
    457                     weight, bias, self.stride,
    458                     _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
    460                 self.padding, self.dilation, self.groups)

RuntimeError: Input type (c10::Half) and bias type (float) should be the same

So here we are several things:

I think the best way is to try your optimization using a container, do you have some container where we can try speedster?

Thanks in advance.

hdnh2006 commented 1 year ago

I was finally able to run it locally using the docker container from ultralytics

However I just get and optimization (according to your library) of 1.35x faster:

image

This optimization is not faster than the normal tensorRT provided by ultralytics: python export.py --include engine --device 0

is there something I am missing?

diegofiori commented 1 year ago

Hi @hdnh2006,

thank you for the contribution. Happy to assist you and accelerate your model together.

We ran again some tests on Yolov5 in this colab and got the following results:

And regarding your first message

About local testing:

hdnh2006 commented 1 year ago

Thanks @diegofiori for your cooperation.

This notebook is not comparing the optimized model after applying the following code:

class OptimizedYolo(torch.nn.Module):
    def __init__(self, optimized_core, head_layer):
        super().__init__()
        self.core = optimized_core
        self.head = head_layer

    def forward(self, x, *args, **kwargs):
        x = list(self.core(x)) # it's a tuple
        return self.head(x)

final_core = OptimizedYolo(model_optimized, last_layer)

model.model.model = final_core

Why is like this? Maybe here is my mistake. Because I am applying my code after these lines of code:

import numpy as np

dummy_input = torch.randn(1, 3, 384, 640, dtype=torch.float).to(device)

# INIT LOGGERS
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 100
no_data_write_timings=np.zeros((repetitions,1))

#GPU-WARM-UP
for _ in range(10):
    _ = model(dummy_input)
# MEASURE PERFORMANCE WITHOUT DATA TRANSFERS
with torch.no_grad():
    for rep in range(repetitions):
        starter.record()
        #dummy_input_on_device = dummy_input.to(device)
        outputs = model(dummy_input)
        ender.record()
        # WAIT FOR GPU SYNC
        torch.cuda.synchronize()
        curr_time = starter.elapsed_time(ender)
        no_data_write_timings[rep] = curr_time

mean_no_data_write_syn = np.sum(no_data_write_timings) / repetitions
std_no_data_write_syn = np.std(no_data_write_timings)

print('Optimized model results WITHOUT data transfers:')
print('The Optimized model mean batch inference time is:' +  str(mean_no_data_write_syn))
print('The Optimized model std batch inference time is:' +  str(std_no_data_write_syn))

Then, what is the final model? I mean, I want to replace this line of code: https://github.com/ultralytics/yolov5/blob/cdd804d39ff84b413bde36a84006f51769b6043b/detect.py#L98

For your optimized model, what should I put to replace it?

hdnh2006 commented 1 year ago

Tried locally with RTX2060 and these results are gotten:

TensorRT optimization by Ultralytics:

Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.3865331208705902 The Optimized model by Ultralytics std batch inference time is:0.011859980193208422

Nebullvm optimization:

Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.081862072944641 The Optimized model std batch inference time is:0.14880756351818578

😞😞😞😞😞

hdnh2006 commented 1 year ago

I tried again with your modified Google Colab notebook and it is true that is 2x faster than the original PyTorch version but it is slower than the TensorRT provided by Ultralytics:

PyTorch model yolov5s

Original model results WITHOUT data transfers: The Original model mean batch inference time is:7.7102195215225215 The Original model std batch inference time is:0.967732421795685

Nebullvm optimization:

Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.6612223982810974 The Optimized model std batch inference time is:0.11819371827359895

TensorRT by Ultralytics

Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.5688025617599488 The Optimized model by Ultralytics std batch inference time is:0.03936980002624129

Check the notebook I modified here: https://colab.research.google.com/drive/1Nde0tCx28g3BTe2nxfLhTCMgIreqcvtw?usp=sharing

diegofiori commented 1 year ago

Hello @hdnh2006 ,

On the different performance respect to the Ultralytics implementation I think this can be due to the input your are giving to the optimize_model function. In fact, when the metric_drop_ths parameter is not given, speedster by default keeps the model in full precision 32 bits. Speedster supports both fp16 and int8 precisions but you have to activate them passing the metric_drop_ths parameter to the optimize_model function.

diegofiori commented 1 year ago

With fp16 precision in speedster I am getting 1.187 ms of inference time. I'm waiting for the int8 result.

hdnh2006 commented 1 year ago

Ok, so what are the values I should put to get fp16 precision?

I can see in the documentation that metric_drop_ths is a float number: https://github.com/nebuly-ai/nebullvm/blob/8aacdd7593746fd3cb71e6575847f028c9f6193d/apps/accelerate/speedster/speedster/api/functions.py#L86

diegofiori commented 1 year ago
model_optimized = optimize_model(
    model=core_wrapper,
    input_data=input_data,
    optimization_time="unconstrained",
    metric_drop_ths=0.1
)
hdnh2006 commented 1 year ago

I tried on TeslaV100 32GB and these are the results obtained:

Nebullvm optimization:

model_optimized = optimize_model(
    model=core_wrapper,
    input_data=input_data,
    optimization_time="unconstrained",
    metric_drop_ths=0.1
)
2023-01-12 19:11:22 | INFO     | Running Speedster on GPU
2023-01-12 19:11:23 | WARNING  | Missing Frameworks: tensorflow.
 Please install them to include them in the optimization pipeline.
2023-01-12 19:11:25 | INFO     | Benchmark performance of original model
2023-01-12 19:11:26 | INFO     | Original model latency: 0.005212109088897705 sec/iter
2023-01-12 19:11:27 | INFO     | Optimizing with PytorchBackendCompiler and q_type: None.
2023-01-12 19:11:29 | INFO     | Optimized model latency: 0.0037038326263427734 sec/iter
2023-01-12 19:11:29 | INFO     | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:11:29 | WARNING  | Unable to trace model with torch.fx
2023-01-12 19:11:31 | INFO     | Optimized model latency: 0.0037364959716796875 sec/iter
2023-01-12 19:11:31 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-01-12 19:11:33 | INFO     | Optimized model latency: 0.006529092788696289 sec/iter
2023-01-12 19:11:33 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.DYNAMIC.
2023-01-12 19:11:40 | WARNING  | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:11:40 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:11:43 | INFO     | Optimized model latency: 0.005489349365234375 sec/iter
2023-01-12 19:11:43 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.STATIC.
2023-01-12 19:11:58 | WARNING  | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:11:58 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-01-12 19:12:22 | INFO     | Optimized model latency: 0.00531458854675293 sec/iter
2023-01-12 19:12:22 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:13:43 | WARNING  | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:13:43 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-01-12 19:16:53 | WARNING  | The optimized model will be discarded due to poor results obtained with the given metric.

[ Speedster results on GPU]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Metric                                    ┃ Original Model   ┃ Optimized Model   ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━┫
┃ backend                                   ┃ PYTORCH          ┃ TorchScript       ┃
┃ latency                                   ┃ 0.0052 sec/batch ┃ 0.0037 sec/batch  ┃
┃ throughput                                ┃ 191.86 data/sec  ┃ 269.99 data/sec   ┃
┃ model size                                ┃ 35.18 MB         ┃ 28.38 MB          ┃
┃ metric drop (compute_relative_difference) ┃                  ┃ 0                 ┃
┃ speedup                                   ┃                  ┃ 1.41x             ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┛

Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.505920317173004 The Optimized model std batch inference time is:0.1691654167164366

TensorRT optimization by Ultralytics with half precision

Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.442447043657303 The Optimized model by Ultralytics std batch inference time is:0.05623926929897324

Definitely, there's something I am doing wrong but I am following all the steps you provide.

Anyway, as you can see in this notebook, the same happens on Tesla T4