nebuly-ai / optimate

A collection of libraries to optimise AI model performances
https://www.nebuly.com/
Apache License 2.0
8.37k stars 641 forks source link

[Speedster] Running into "CUDA out of memory" on an A100 #301

Open michaelbogdan opened 1 year ago

michaelbogdan commented 1 year ago

I am trying speedster to optimize inference of oasst-sft-1-pythia-12b on a rented A100 with 40GB VRAM on Lambda Cloud. The code I use is pasted here:

from speedster import optimize_model, save_model
from transformers import GPTNeoXForCausalLM, AutoTokenizer

cache_directory = "./model_cache"
model = GPTNeoXForCausalLM.from_pretrained("OpenAssistant/oasst-sft-1-pythia-12b", cache_dir=cache_directory).half()
tokenizer = AutoTokenizer.from_pretrained("OpenAssistant/oasst-sft-1-pythia-12b", cache_dir=cache_directory)

text = "<|prompter|>Answer with as many words as possible: Do horses lay eggs?<|endoftext|><|assistant|>"
input_dict = tokenizer(text, return_tensors="pt")
input_data = [input_dict for _ in range(100)]

optimized_model = optimize_model(
    model, 
    input_data=input_data,
    optimization_time="constrained",
    metric_drop_ths=0.05
)

save_model(optimized_model, "model_save_path")

However, I always run into error messages like

2023-03-26 23:06:11 | INFO     | Running Speedster on GPU:0
2023-03-26 23:06:17 | WARNING  | Dynamic shape info has not been provided for the HuggingFace model. The resulting optimized model will be usable only with a fixed input shape. To optimize the model for dynamic shapes, please look here: https://nebuly.gitbook.io/nebuly/modules/speedster/how-to-guides#using-dynamic-shape.
2023-03-26 23:06:23 | INFO     | Benchmark performance of original model
2023-03-26 23:06:29 | INFO     | Original model latency: 0.03949021577835083 sec/iter
2023-03-26 23:06:40 | WARNING  | Exception raised during conversion from torch to onnx model. ONNX pipeline will be unavailable.
2023-03-26 23:07:08 | INFO     | [1/1] Running PyTorch Optimization Pipeline
2023-03-26 23:07:16 | INFO     | Optimizing with PyTorchTensorRTCompiler and q_type: None.
2023-03-26 23:07:44 | WARNING  | Optimization failed with DeepLearningFramework.PYTORCH interface of ModelCompiler.TENSOR_RT_TORCH. Got error [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:169] Building serialized network failed in TensorRT
. If possible the compilation will be re-scheduled with another interface. Please consult the documentation for further info or open an issue on GitHub for receiving assistance.

What are some possibilities I can try? I know that the model itself takes up about 25GiB of VRAM, so it doesn't fit into the GPU's memory twice. It seems like speedster is not flushing the previous model out of memory.

I had the idea to just load the model into actual RAM and perform the optimizations there, but reading the docs I only found the device paramater, which controls the target for inference, no option for where the compilation takes place.

Can you maybe help me?

valeriosofi commented 1 year ago

Hello @michaelbogdan, thanks for trying Speedster! Yep your model takes up about 25GB of VRAM, unfortunately when converting to onnx additional VRAM is requested and that's the reason why you see the warning Exception raised during conversion from torch to onnx model. ONNX pipeline will be unavailable, due to this exception Speedster will be able to try only the pytorch-based compilers, which are torchscript and torch_tensor_rt. To avoid that warning you should have at least 50GB of VRAM from my experiments (I tried on an A100-80GB). I can't find the CUDA out of memory error in the logs you provided, did it happen during the PyTorchTensorRTCompiler optimization? If so, you could simply try skipping that compiler setting ignore_compilers=["torch_tensor_rt"].

michaelbogdan commented 1 year ago

Hello @valeriosofi, thank you for replying! Looks like I didn't paste enough of the logs, if you look at the timestamps it was getting late after all. Since Lambda Cloud doesn't offer A100's with 80GiB of VRAM I'll try again with another cloud. Should I get back to you?

I think that for the ONNX pipeline the user should get a more clear warning why the pipeline failed. While skimming the logs I skipped over the warning as no clear reason was provided.

valeriosofi commented 1 year ago

Yep, let me know if you are able to optimize it with more gpu memory! Yep you are right, we could print some additional details about the error when a conversion fails, thanks for the feedback! We'll work on it for the next release!

michaelbogdan commented 1 year ago

Hey @valeriosofi, another question while I am trying to get access to an A100-80GiB: I tried the Stable Diffusion example from the quickstart page on an A10 at Lambda Cloud like this. The optimization isn't great, but it seems like TensorRT isn't available. Can you suggest things I could try to get it running?

import torch
from diffusers import StableDiffusionPipeline
from speedster import optimize_model, save_model

#1 Provide input model and data
model_id = "CompVis/stable-diffusion-v1-4"
device = "cuda" if torch.cuda.is_available() else "cpu"

if device == "cuda":
    # On GPU we load by default the model in half precision, because it's faster and lighter.
    pipe = StableDiffusionPipeline.from_pretrained(model_id, revision='fp16', torch_dtype=torch.float16)
else:
    pipe = StableDiffusionPipeline.from_pretrained(model_id)

# Create some example input data
input_data = [
    "a photo of an astronaut riding a horse on mars",
    "a monkey eating a banana in a forest",
    "white car on a road surrounded by palm trees",
    "a fridge full of bottles of beer",
    "madara uchiha throwing asteroids against people"
]

#2 Run Speedster optimization
optimized_model = optimize_model(
    model=pipe,
    input_data=input_data,
    optimization_time="constrained",
    # ignore_compilers=["torch_tensor_rt", "tvm"],
    metric_drop_ths=0.1,
)

#3 Save the optimized model
save_model(optimized_model, "model_save_path")

Output:

Fetching 16 files: 100%
16/16 [00:00<00:00, 1168.98it/s]

Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 

pip install accelerate

.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.

2023-03-28 14:49:30 | INFO     | Running Speedster on GPU:0
2023-03-28 14:49:32 | WARNING  | Missing Dependencies: tf2onnx.
 Without them, some compilers may not work properly.
2023-03-28 14:49:32 | INFO     | The provided model is a diffusion model. Speedster will optimize the UNet part of the model.
2023-03-28 14:49:33 | WARNING  | Detected not consistent batch size in the inputs.
2023-03-28 14:49:33 | WARNING  | Not enough data for splitting the DataManager. You should provide at least 100 data samples to allow a good split between train and test sets. Compression, calibration and precision checks will use the same data.
2023-03-28 14:49:34 | INFO     | Benchmark performance of original model
2023-03-28 14:49:44 | INFO     | Original model latency: 0.08197490215301513 sec/iter
2023-03-28 14:51:28 | INFO     | [1/2] Running PyTorch Optimization Pipeline
2023-03-28 14:51:29 | INFO     | Optimizing with PytorchBackendCompiler and q_type: None.
2023-03-28 14:51:29 | WARNING  | Unable to trace model with torch.fx
2023-03-28 14:51:35 | INFO     | Optimized model latency: 0.07764577865600586 sec/iter
2023-03-28 14:51:36 | INFO     | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-03-28 14:51:36 | WARNING  | Unable to trace model with torch.fx
2023-03-28 14:51:43 | INFO     | Optimized model latency: 0.07759284973144531 sec/iter
2023-03-28 14:51:43 | INFO     | [2/2] Running ONNX Optimization Pipeline
2023-03-28 14:51:43 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-03-28 14:51:46 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-03-28 14:52:09 | INFO     | Optimized model latency: 0.12782716751098633 sec/iter
2023-03-28 14:52:09 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-03-28 14:52:23 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-03-28 14:52:45 | INFO     | Optimized model latency: 0.12182903289794922 sec/iter
2023-03-28 14:52:45 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-03-28 14:52:45 | WARNING  | Skipping float32 precision for Stable Diffusion, half precision will be used instead.
2023-03-28 14:52:45 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-03-28 14:52:46 | WARNING  | Optimization failed with DeepLearningFramework.NUMPY interface of ModelCompiler.TENSOR_RT_ONNX. Got error type object 'DummyClass' has no attribute 'import_onnx'. If possible the compilation will be re-scheduled with another interface. Please consult the documentation for further info or open an issue on GitHub for receiving assistance.
2023-03-28 14:52:46 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-03-28 14:52:46 | WARNING  | Skipping static quantization for Stable Diffusion because for now it's not supported.

[Speedster results on NVIDIA A10]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric      ┃ Original Model   ┃ Optimized Model   ┃ Improvement   ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend     ┃ PYTORCH          ┃ TorchScript       ┃               ┃
┃ latency     ┃ 0.0820 sec/batch ┃ 0.0776 sec/batch  ┃ 1.06x         ┃
┃ throughput  ┃ 12.20 data/sec   ┃ 12.89 data/sec    ┃ 1.06x         ┃
┃ model size  ┃ 1719.35 MB       ┃ 1720.77 MB        ┃ 0%            ┃
┃ metric drop ┃                  ┃ 0.0064            ┃               ┃
┃ techniques  ┃                  ┃ fp16              ┃               ┃
┗━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┛
valeriosofi commented 1 year ago

I see the error Optimization failed with DeepLearningFramework.NUMPY interface of ModelCompiler.TENSOR_RT_ONNX. Got error type object 'DummyClass' has no attribute 'import_onnx'. If possible the compilation will be re-scheduled with another interface. Please consult the documentation for further info or open an issue on GitHub for receiving assistance. during TensorRT compilation in half precision, looks like you are missing the graphsurgeon dependency, running pip install onnx-graphsurgeon --extra-index-url https://pypi.ngc.nvidia.com should solve it, let me know ;)

michaelbogdan commented 1 year ago

Hey @valeriosofi, thank you for the suggestion, the output for Stable Diffusion now looks like this. Seems like your hint worked, maybe the pip install command should be included in the auto installer for people like me, who start from blank slate / fresh system.

I am getting a 30% speedup on an A10, leading to a bit more than 16it/s or just under one frame per second - 0.825 fps to be precise - which is quite good. Is this result in line with your experiments, experiences or internal data?

Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 

pip install accelerate

.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.

2023-03-29 06:28:23 | INFO     | Running Speedster on GPU:0
2023-03-29 06:28:25 | WARNING  | Missing Dependencies: tf2onnx.
 Without them, some compilers may not work properly.
2023-03-29 06:28:25 | INFO     | The provided model is a diffusion model. Speedster will optimize the UNet part of the model.
2023-03-29 06:28:28 | WARNING  | Detected not consistent batch size in the inputs.
2023-03-29 06:28:30 | WARNING  | Not enough data for splitting the DataManager. You should provide at least 100 data samples to allow a good split between train and test sets. Compression, calibration and precision checks will use the same data.
2023-03-29 06:28:31 | INFO     | Benchmark performance of original model
2023-03-29 06:28:40 | INFO     | Original model latency: 0.08061829805374146 sec/iter
2023-03-29 06:30:17 | INFO     | [1/2] Running PyTorch Optimization Pipeline
2023-03-29 06:30:17 | INFO     | Optimizing with PytorchBackendCompiler and q_type: None.
2023-03-29 06:30:18 | WARNING  | Unable to trace model with torch.fx
2023-03-29 06:30:59 | INFO     | Optimized model latency: 0.07738518714904785 sec/iter
2023-03-29 06:30:59 | INFO     | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-03-29 06:30:59 | WARNING  | Unable to trace model with torch.fx
2023-03-29 06:31:06 | INFO     | Optimized model latency: 0.07678866386413574 sec/iter
2023-03-29 06:31:06 | INFO     | [2/2] Running ONNX Optimization Pipeline
2023-03-29 06:31:06 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-03-29 06:31:09 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-03-29 06:31:34 | INFO     | Optimized model latency: 0.12715411186218262 sec/iter
2023-03-29 06:31:34 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-03-29 06:31:47 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-03-29 06:32:12 | INFO     | Optimized model latency: 0.12097835540771484 sec/iter
2023-03-29 06:32:12 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-03-29 06:32:12 | WARNING  | Skipping float32 precision for Stable Diffusion, half precision will be used instead.
2023-03-29 06:32:12 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-03-29 06:42:22 | INFO     | Optimized model latency: 0.06221413612365723 sec/iter
2023-03-29 06:42:22 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-03-29 06:42:22 | WARNING  | Skipping static quantization for Stable Diffusion because for now it's not supported.

[Speedster results on NVIDIA A10]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric      ┃ Original Model   ┃ Optimized Model   ┃ Improvement   ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend     ┃ PYTORCH          ┃ TensorRT          ┃               ┃
┃ latency     ┃ 0.0806 sec/batch ┃ 0.0622 sec/batch  ┃ 1.30x         ┃
┃ throughput  ┃ 12.40 data/sec   ┃ 16.07 data/sec    ┃ 1.30x         ┃
┃ model size  ┃ 1719.35 MB       ┃ 1726.96 MB        ┃ 0%            ┃
┃ metric drop ┃                  ┃ 0.0260            ┃               ┃
┃ techniques  ┃                  ┃ fp16              ┃               ┃
┗━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┛

Max speed-up with your input parameters is 1.30x. If you want to get a faster optimized model, see the following link for some suggestions: https://docs.nebuly.com/Speedster/advanced_options/#acceleration-suggestions
valeriosofi commented 1 year ago

It's a bit strange, we tested also on an A10 and we got quite different results. In our experiments we got a 2x speedup compared to the original fp16 model, are you sure that you are using tensorrt==8.6.0 and CUDA>=12.0? It's important that the tensorrt version matches because in previous versions it supported fewer optimizations for stable diffusion, so it was slower. Another thing, graphsurgeon is already included in the autoinstaller, I've just tested it running the command python -m nebullvm.installers.auto_installer --compilers all and it installed it succesfully, what command did you use to run the autoinstaller?

michaelbogdan commented 1 year ago

@valeriosofi thank you for the continued support, we are getting closer! Indeed, Lambda Cloud comes with CUDA<12.0, after updating to CUDA 12.0 and using python -m nebullvm.installers.auto_installer --compilers all we get now the result below. Better at over +80%, we are now even at sub second inference for single images with 20 inference steps, but not quite the 2x speedup.

Should I increase metric_drop_thsfrom 0.1 to 0.5` or ignore some compilers?

2023-03-29 09:13:49 | INFO     | Running Speedster on GPU:0
2023-03-29 09:13:50 | INFO     | The provided model is a diffusion model. Speedster will optimize the UNet part of the model.
2023-03-29 09:13:54 | WARNING  | Detected not consistent batch size in the inputs.
2023-03-29 09:13:55 | WARNING  | Not enough data for splitting the DataManager. You should provide at least 100 data samples to allow a good split between train and test sets. Compression, calibration and precision checks will use the same data.
2023-03-29 09:13:56 | INFO     | Benchmark performance of original model
2023-03-29 09:14:06 | INFO     | Original model latency: 0.07966887950897217 sec/iter
2023-03-29 09:14:39 | INFO     | [1/2] Running PyTorch Optimization Pipeline
2023-03-29 09:14:40 | INFO     | Optimizing with PytorchBackendCompiler and q_type: None.
2023-03-29 09:14:40 | WARNING  | Unable to trace model with torch.fx
2023-03-29 09:15:20 | INFO     | Optimized model latency: 0.07608795166015625 sec/iter
2023-03-29 09:15:20 | INFO     | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-03-29 09:15:21 | WARNING  | Unable to trace model with torch.fx
2023-03-29 09:15:27 | INFO     | Optimized model latency: 0.07613825798034668 sec/iter
2023-03-29 09:15:27 | INFO     | Optimizing with PyTorchTensorRTCompiler and q_type: None.
2023-03-29 09:15:47 | WARNING  | Optimization failed with DeepLearningFramework.PYTORCH interface of ModelCompiler.TENSOR_RT_TORCH. Got error [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:169] Building serialized network failed in TensorRT
. If possible the compilation will be re-scheduled with another interface. Please consult the documentation for further info or open an issue on GitHub for receiving assistance.
2023-03-29 09:15:47 | INFO     | Optimizing with PyTorchTensorRTCompiler and q_type: QuantizationType.HALF.
2023-03-29 09:19:58 | WARNING  | Optimization failed with DeepLearningFramework.PYTORCH interface of ModelCompiler.TENSOR_RT_TORCH. Got error [Error thrown at core/conversion/conversionctx/ConversionCtx.cpp:169] Building serialized network failed in TensorRT
. If possible the compilation will be re-scheduled with another interface. Please consult the documentation for further info or open an issue on GitHub for receiving assistance.
2023-03-29 09:19:58 | INFO     | Optimizing with PyTorchTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-03-29 09:19:58 | WARNING  | Static quantization is not available when using dynamic shape
2023-03-29 09:19:58 | INFO     | [2/2] Running ONNX Optimization Pipeline
2023-03-29 09:19:58 | INFO     | Optimizing with ONNXCompiler and q_type: None.
2023-03-29 09:20:02 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-03-29 09:20:22 | INFO     | Optimized model latency: 0.12598967552185059 sec/iter
2023-03-29 09:20:22 | INFO     | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-03-29 09:20:37 | WARNING  | TensorrtExecutionProvider for onnx is not available. If you want to use it, please  add the path to tensorrt to the LD_LIBRARY_PATH environment variable. CUDA provider will be used instead. 
2023-03-29 09:20:56 | INFO     | Optimized model latency: 0.1198880672454834 sec/iter
2023-03-29 09:20:56 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-03-29 09:20:56 | WARNING  | Skipping float32 precision for Stable Diffusion, half precision will be used instead.
2023-03-29 09:20:56 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-03-29 09:26:31 | INFO     | Optimized model latency: 0.04310011863708496 sec/iter
2023-03-29 09:26:31 | INFO     | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-03-29 09:26:31 | WARNING  | Skipping static quantization for Stable Diffusion because for now it's not supported.

[Speedster results on NVIDIA A10]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric      ┃ Original Model   ┃ Optimized Model   ┃ Improvement   ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend     ┃ PYTORCH          ┃ TensorRT          ┃               ┃
┃ latency     ┃ 0.0797 sec/batch ┃ 0.0431 sec/batch  ┃ 1.85x         ┃
┃ throughput  ┃ 12.55 data/sec   ┃ 23.20 data/sec    ┃ 1.85x         ┃
┃ model size  ┃ 1719.35 MB       ┃ 1729.70 MB        ┃ 0%            ┃
┃ metric drop ┃                  ┃ 0.0217            ┃               ┃
┃ techniques  ┃                  ┃ fp16              ┃               ┃
┗━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━┛

Max speed-up with your input parameters is 1.85x. If you want to get a faster optimized model, see the following link for some suggestions: https://docs.nebuly.com/Speedster/advanced_options/#acceleration-suggestions

CUDA and TensorRT are up to date:

ubuntu@192-9-148-47:~$ nvidia-smi
Wed Mar 29 09:43:40 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10          On   | 00000000:06:00.0 Off |                    0 |
|  0%   47C    P0    61W / 150W |  12104MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2261      C   /usr/bin/python3                12102MiB |
+-----------------------------------------------------------------------------+
ubuntu@192-9-148-47:~$ pip show tensorrt
Name: tensorrt
Version: 8.6.0
Summary: A high performance deep learning inference library
Home-page: https://developer.nvidia.com/tensorrt
Author: NVIDIA Corporation
Author-email: 
License: Proprietary
Location: /home/ubuntu/.local/lib/python3.8/site-packages
Requires: nvidia-cublas-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12
Required-by: torch-tensorrt
valeriosofi commented 1 year ago

Hi @michaelbogdan, I think that now we have reached the optimal speedup! I looked again to my numbers and I obtained 1.9x so it's quite similar!