jchenghu / ExpansionNet_v2

Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"
https://arxiv.org/abs/2208.06551
MIT License
84 stars 24 forks source link

Convert to ONNX then to TensorRT format #2

Closed shahizat closed 6 months ago

shahizat commented 1 year ago

Hello ExpansionNet_v2 contributors,

I was wondering if anyone has attempted to convert ExpansionNet_v2 based ".pth" model to ".onnx" and then to TensorRT format?

Thank you and best regards, Shakhizat

jchenghu commented 1 year ago

Hi Shakhizat,

Unfortunately we haven't attempted yet, but we will try and will let you know in these days

Thank you for the patience, Best regards, Jia

jchenghu commented 1 year ago

Hi,

Sorry for the delay, we encountered some setbacks during the installation of onnx-tensorrt backend. Do you have a more specific question on the matter? For example, do you need the .onnx file? Are you interested in the partial model or end to end model?

Best regards, Jia

shahizat commented 1 year ago

Hello @jchenghu , Thank you for your response. Could you please provide me with a step-by-step procedure on how you converted the model to ONNX? I am interested in running your model on the Nvidia Jetson NX using TensorRT optimization.

jchenghu commented 1 year ago

Hi,

Thank you for the interest. I'm currently preparing the edited version of the model's code as well as the export script. The exporting the graph has been being trickier than expected (e.g. I encountered strange errors such as SIGFPE whenever I included the masking operation). I will provide you the files as soon as I can.

jchenghu commented 1 year ago

I pushed the onnx conversion file and the onnx file here https://drive.google.com/drive/folders/1bBMH4-Fw1LcQZmSzkMCqpEl0piIP88Y3?usp=share_link.

However, although the the model can be converted successfully and passes the onnx checker, the onnx graph strangely fails both onnx_tensorrt backend and onnxruntime tests raising errors such as:

RuntimeError: While parsing node number 76:
ModelImporter.cpp:162 In function parseGraph:  [6] Invalid Node - /swin_transf/0/blocks.0/Reshape_5  constant /swin_transf/0/blocks.0/Constant_5_output_0 is not a valid shape tensor

From the graph visualization https://netron.app/ It should be caused by this operation in the SwinTransformer block,

attn = self.softmax(attn)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B_, N, C)
                               ^^^^^^<-- Here
x = self.proj(x)

but I did not figure out why...

Could you please provide me with a step-by-step procedure on how you converted the model to ONNX?

My system info:

I hope the file convert2onnx.py helps as a reference. I mainly adopted scripting for simplicity. The model surely benefits from a hybrid approach of both tracing and scripting but for the moment I'm focusing on providing something that works regardless its efficiency.

Beside that, I edited the code in few parts:

I'm still working on it. For the time being, the onnx conversion succeeds but the onnx_tensorrt backend fails. I wonder if it's caused by a version mismatch of my packages. I'd like know if the same occurs to you.

shahizat commented 1 year ago

Hi @jchenghu,

Thanks for your detailed response. Just FYI, I experienced below issue

import onnxruntime
import onnx
import cv2

onnx_model = onnx.load('./rf_model.onnx')

try:
    onnx.checker.check_model(onnx_model)
except onnx.checker.ValidationError as e:
    print("The model is invalid: %s" % e)
else:
    print("The model is valid!")

im = cv2.imread('./demo_material/micheal.jpg')
print(im.shape)

ort_sess = onnxruntime.InferenceSession('./rf_model.onnx',providers=[ 'CPUExecutionProvider'])
outputs = ort_sess.run(None, {'input': im})
print(outputs)

output


The model is valid!
(589, 880, 3)
Traceback (most recent call last):
  File "checkcuda.py", line 23, in <module>
    ort_sess = onnxruntime.InferenceSession('./rf_model.onnx',providers=[ 'CPUExecutionProvider'])
  File "/home/jetson/.virtualenvs/ocr/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/jetson/.virtualenvs/ocr/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./rf_model.onnx failed:This is an invalid model. Type Error: Type 'tensor(double)' of input parameter (onnx::Reshape_7646) of operator (Reshape) in node (Reshape_309) is invalid.
jchenghu commented 1 year ago

I believe your particular error is caused by the absence of pre-processing. The input should be resized first to (384, 384) otherwise the model can't be generalized to all images (a pre-processing snippet is shown in convert2onnx.py :-))

shahizat commented 1 year ago

Anyway, the same error, if you don't mind, please provide a code for inference with Onnx Runtime.

import onnxruntime
import onnx
import cv2
import torchvision
onnx_model = onnx.load('./rf_model.onnx')
from PIL import Image as PIL_Image
try:
    onnx.checker.check_model(onnx_model)
except onnx.checker.ValidationError as e:
    print("The model is invalid: %s" % e)
else:
    print("The model is valid!")

img_size = 384
# Pre-Processing
transf_1 = torchvision.transforms.Compose([torchvision.transforms.Resize((img_size, img_size))])
transf_2 = torchvision.transforms.Compose([torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                            std=[0.229, 0.224, 0.225])])

pil_image = PIL_Image.open('./demo_material/micheal.jpg')
if pil_image.mode != 'RGB':
    pil_image = PIL_Image.new("RGB", pil_image.size)
preprocess_pil_image = transf_1(pil_image)
image = torchvision.transforms.ToTensor()(preprocess_pil_image)
image = transf_2(image)

# <- QUESTO MI SERVE PER TESTARE POI SWIN TRANSF, per ora testo solo la parte DOPO..
print(image.shape)
print(image)
#im = cv2.imread('./demo_material/micheal.jpg')
ort_sess = onnxruntime.InferenceSession('./rf_model.onnx',providers=[ 'CPUExecutionProvider'])
outputs = ort_sess.run(None, {'input': image})
print(outputs)
jchenghu commented 1 year ago

Sure,

I updated the conversion file with the following part:

import onnxruntime as ort
    onnx_model = onnx.load(args.output_onnx_path)
    ort_sess = ort.InferenceSession(args.output_onnx_path)
    input_dict = {'enc_x': image.numpy(),
                  'enc_x_num_pads': torch.tensor([0]).numpy(),
                  'sos_idx': coco_tokens['word2idx_dict'][coco_tokens['sos_str']],
                  'eos_idx': coco_tokens['word2idx_dict'][coco_tokens['eos_str']],
                  'beam_size': 5,
                  'max_seq_len': 20}
    outputs_ort = ort_sess.run(None, input_dict)

in your case you just need to replace args.output_onnx_path with ./rf_model.onnx

I also report the error I'm currently investigating on.

.../python3.7/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3190.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading model...
Performing forwards...
.../python3.7/site-packages/torch/nn/modules/module.py:1194: UserWarning: operator() sees varying value in profiling, ignoring and this should be handled by GUARD logic (Triggered internally at ../torch/csrc/jit/codegen/cuda/parser.cpp:3668.)
  return forward_call(*input, **kwargs)
ONNX graph conversion done.
ONNX graph checked.
Testing
Traceback (most recent call last):
  File ".../ExpansionNet_v2_src/onnx_conversion/convert2onnx.py", line 277, in <module>
    ort_sess = ort.InferenceSession(args.output_onnx_path)
  File ".../python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File ".../python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 397, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./rf_model.onnx failed:This is an invalid model. Type Error: Type 'tensor(double)' of input parameter (/swin_transf/0/blocks.0/Constant_5_output_0) of operator (Reshape) in node (/swin_transf/0/blocks.0/Reshape_5) is invalid.
shahizat commented 1 year ago

Hi, @jchenghu thanks, I apologize if I am being pushy. I am hopeful that it will be resolved soon.

jchenghu commented 1 year ago

Hi,

Don't worry, you're not being pushy at all :-)

Actually, I'm sorry but I've been busy for the previous days because of deadlines. I restarted working on it yesterday. Currently I'm trying to find alternate equivalent implementations in the attempt of bypassing that error.

I'll keep you updated.

shahizat commented 1 year ago

IMHO, integrating the ONNX support and TensorRT framework into ExpansionNet will undoubtedly enhance its functionality. I believe this project deserves attention from machine learning enthusiasts worldwide. It's a fantastic project!

jchenghu commented 1 year ago

Thank you very much for the support!

Actually, we have NVIDIA Jetson in our lab as well, so we share the same goal :-)

shahizat commented 1 year ago

Hi @jchenghu, just fyi, we've already trained image captioning for Kazakh language using ExpansionNetV2 and deployed onto Nvidia Jetson Xavier NX. Regular pytorch model sometimes can overload board and jetson throttles ifself afterwards due to overheating. Our github project: https://github.com/IS2AI/kaz-image-captioning

jchenghu commented 1 year ago

Our github project: https://github.com/IS2AI/kaz-image-captioning

That's simply amazing. I'm truly happy and extremely honored this project helped in such cause.


Update,

Good news: Regarding the issue, I managed to run ONNX runtime inference session successfully. If you are interested, the previous error was caused by the window_reverse function. Although I solved the issue, the cause of it are still unknown to me, for some reason ONNX runtime didn't like the function and the use of the variable B, so I had to move the operations out of the function and used additional passages compared to the original formulation to avoid using the value B (using -1 in the reshaping). The result is the same (reason why I still don't know why ONNX runtime failed in the first place). Ultimately, the ONNX Runtime successfully generates the correct caption now. Updated files can be found in onnx_conversion. convert2onnx.py produces a lot of warnings but they are related to the removal of unused nodes.

Bad news: I'm testing now the onnx_tensorrt backend and there's a new error:

  File ".../python3.7/site-packages/onnx_tensorrt-8.0.1-py3.7.egg/onnx_tensorrt/backend.py", line 236, in prepare
  File ".../python3.7/site-packages/onnx_tensorrt-8.0.1-py3.7.egg/onnx_tensorrt/backend.py", line 68, in __init__
RuntimeError: While parsing node number 2656:
onnx2trt_utils.cpp:2033 In function unaryHelper:
[8] Assertion failed: validUnaryType && "This version of TensorRT does not support the given operator with the given input data type."

I'll try to fix this as well as soon as possible. I'll keep you updated

Jia

shahizat commented 1 year ago

Hi @jchenghu, thanks for your detailed response. I am still experiencing below issue:

/home/jetson/.virtualenvs/ocr/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 
  warn(f"Failed to load image Python extension: {e}")
Testing firts image on ONNX runtime
Traceback (most recent call last):
  File "test_onnx.py", line 30, in <module>
    ort_sess = ort.InferenceSession('./model.onnx')
  File "/home/jetson/.virtualenvs/ocr/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/jetson/.virtualenvs/ocr/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 384, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from ./model.onnx failed:This is an invalid model. Type Error: Type 'tensor(double)' of input parameter (/swin_transf/0/blocks.0/Constant_5_output_0) of operator (Reshape) in node (/swin_transf/0/blocks.0/Reshape_5) is invalid.

the code is

import onnxruntime as ort
import numpy as np    
import pickle
import torchvision
from PIL import Image as PIL_Image

img_size = 384
with open('./demo_material/demo_coco_tokens.pickle', 'rb') as f:
    coco_tokens = pickle.load(f)

# Pre-Processing
def preprocess_image(image_path):
    transf_1 = torchvision.transforms.Compose([torchvision.transforms.Resize((img_size, img_size))])
    transf_2 = torchvision.transforms.Compose([torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                                std=[0.229, 0.224, 0.225])])

    pil_image = PIL_Image.open(image_path)
    if pil_image.mode != 'RGB':
        pil_image = PIL_Image.new("RGB", pil_image.size)
    preprocess_pil_image = transf_1(pil_image)
    image = torchvision.transforms.ToTensor()(preprocess_pil_image)
    image = transf_2(image)
    return image.unsqueeze(0)

# we test the generalization of the graph by testing on two images
image_1 = preprocess_image('./demo_material/tatin.jpg')
# generate optimized graph
print("Testing first image on ONNX runtime")
ort_sess = ort.InferenceSession('./model.onnx')
input_dict = {'enc_x': image_1.numpy(),
                'enc_x_num_pads': np.array([0]),
                'sos_idx': np.array([coco_tokens['word2idx_dict'][coco_tokens['sos_str']]]),
                'eos_idx': np.array([coco_tokens['word2idx_dict'][coco_tokens['eos_str']]]),
                'max_seq_len': np.array([20])}
outputs_ort = ort_sess.run(None, input_dict)
output_caption = [coco_tokens['idx2word_list'][idx] for idx in outputs_ort[0][0]]
print("\n\n\nONNX Runtime result:\n\t\t" + str(' '.join(output_caption)), end="\n\n\n")
jchenghu commented 1 year ago

Hi,

That's odd... can you confirm that swin_transformer_onnx.py contains the following lines? (note that window_reverse should never be invoked)

        # W-MSA/SW-MSA
        attn_windows = self.attn(x_windows, mask=self.attn_mask, mask_is_none=self.attn_mask_is_none)
        # nW*B, window_size*window_size, C

        # merge windows
        # attn_windows = attn_windows.view(-1, self.window_size, self.window_size, C)
        # shifted_x = window_reverse(attn_windows, self.window_size, H, W)  # B H' W' C

        # custom window_reverse designed for ONNX - - - -
        x = attn_windows.view(-1, self.window_size, self.window_size, C)
        x = x.view(-1, self.window_size * self.window_size, C)
        W_div_ = int(W / self.window_size)
        x = x.view(-1, W_div_, self.window_size * self.window_size, C)
        x = x.view(-1, self.window_size, self.window_size * C)
        x = x.transpose(0, 1).contiguous()  # [window_size, B*(H // window_size)*(W // window_size), window_size * C]
        x = x.reshape(self.window_size, -1, W * C)
        x = x.transpose(0, 1).contiguous()  # (B * H // window_size, window_size, W * C)
        x = x.reshape(-1, H, W * C)
        x = x.reshape(-1, H, W, C)
        shifted_x = x
        # - - - - - - - - - - - - - - - - - - -

Edit: This is a trivial matter, but since your code seems a little bit different compared to convert2onnx.py, just to be sure, remember also to re-create the ONNX graph (with the latest implementation of swim_transformer_onnx.py) before invoking the ONNX runtime.

jchenghu commented 1 year ago

Good news, after a lot of digging and fixing I solved the problems and warnings mentioned in the previous post and successfully exported the model on TensorRT on my machine (machine details are reported above) .

I suggest you to check the latest commit. I renamed the folder onnx_conversion into onnx4tensorrt. The ONNX graph generated by the updated version ofconvert2onnx.py(*) can be converted into TensorRT using trtexec. Thus the conversion model -> ONNX -> TensorRT should be completed now.

(*) the testing phase on onnx_tensorrt may take a while, you can stop the execution as soon as the onnx file is generated and feed the ONNX file directly to trtexec.

shahizat commented 1 year ago

@jchenghu, good to hear it. I solved the format conversion from .pth to .onnx, the issue was due to torch and torchvision versions. I assume torchvision needs to be compiled from source on the Jetson rather than installing using pip. Currently, i can not build from the source - https://github.com/onnx/onnx-tensorrt package. Did u install it from the source as well?

jchenghu commented 1 year ago

Thank you for the feedback.

Yes, the onnx-tensorrt package was installed from the source. However, if you encounter issues with this package, since it was included in convert2onnx.py only for testing purposes, I suggest you to skip/ignore this step and feed the ONNX graph directly to trtexec.

shahizat commented 1 year ago

@jchenghu , just fyi, ort_sess = ort.InferenceSession('./model.onnx'), converted ONNX model runs on CPU but not on GPU, even with below my modifcation: ort_sess = ort.InferenceSession('./rf_model.onnx', providers=[ ("CUDAExecutionProvider", {"cudnn_conv_algo_search": "DEFAULT"}), "CPUExecutionProvider" ] )

I failed to convert to tensorRT format using below command: /usr/src/tensorrt/bin/trtexec --onnx=./rf_model.onnx --saveEngine=./model_fp32.engine --workspace=20000 The error output: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [03/03/2023-12:59:05] [E] [TRT] ModelImporter.cpp:729: --- End node --- [03/03/2023-12:59:05] [E] [TRT] ModelImporter.cpp:731: ERROR: ModelImporter.cpp:168 In function parseGraph: [6] Invalid Node - TopK_6874 This version of TensorRT only supports input K as an initializer. Try applying constant folding on the model using Polygraphy: https://github.com/NVIDIA/TensorRT/tree/master/tools/Polygraphy/examples/cli/surgeon/02_folding_constants [03/03/2023-12:59:05] [E] Failed to parse onnx file [03/03/2023-12:59:05] [I] Finish parsing network model [03/03/2023-12:59:05] [E] Parsing model failed [03/03/2023-12:59:05] [E] Failed to create engine from model or file. [03/03/2023-12:59:05] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=./rf_model.onnx --saveEngine=./model_fp32.engine --workspace=20000

Could you please share your opinion about it ? Thanks

shahizat commented 1 year ago

@jchenghu it might be due to I missed the onnx simplify operation. But I am not sure ,if it is needed.

jchenghu commented 1 year ago

converted ONNX model runs on CPU but not on GPU,

My bad. Like in the case of the Demo I implemented the ONNX conversion on CPU rather than GPU because I was afraid the memory cost of the end to end forward could be too much for some GPUs. But probably it's safe to assume that if anyone is interested in running on TensorRT it should have enough memory for the conversion... :-) I'll try to fix that in the next commit if this represents an issue.

Could you please share your opinion about it ? Thanks

Could you please share your TensorRT version? Mine is 8.0 and from what I understand the topk should be supported (https://github.com/onnx/onnx-tensorrt/blob/8.0-GA/docs/operators.md)...

shahizat commented 1 year ago

@jchenghu , no problem, I am also trying to help u, but I am no so experienced like you. I was just able convert to TensorRT using trtexec. The problem was due to that onnx model was not simplified(https://github.com/daquexian/onnx-simplifier). Could you please confirm it?You need to add onnx simplify into your code, If I am not mistaken here. TenrsorRT version is 8.5.2.

jchenghu commented 1 year ago

I'm learning a lot thanks to your feedbacks. If I'm not mistaken, since the ONNX graph is an intermediate representation I expect it is not affected by whether the model was converted using the CPU or GPU.

I was just able convert to TensorRT using trtexec.

Good to hear!

The problem was due to that onnx model was not simplified(https://github.com/daquexian/onnx-simplifier). Could you please confirm it?You need to add onnx simplify into your code, If I am not mistaken here. TenrsorRT version is 8.5.2.

That's odd, I did not use the simplifier. it may be an issue related to your specific version. However, the solution seem to be a general good practice, I'll try it on my version and if everything works fine I'll integrate it in the next commit.

shahizat commented 1 year ago

@jchenghu, imho, an additional code snippet for TensorRT-based engine file inference using "import tensorrt as trt" should be included. This is because Onnx models are unable to fully leverage the potential of Nvidia-based GPUs.

Also just recently, i was not able convert our .pth model in Kazakh language to onnx format, your English model I can. Please see below error: return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Loading model... Traceback (most recent call last): File "convert2onnx.py", line 90, in <module> partially_load_state_dict(model, checkpoint['model_state_dict']) File "/home/admin/Projects/ExpansionNet_v2/utils/saving_utils.py", line 104, in partially_load_state_dict own_state[name].copy_(param) RuntimeError: The size of tensor a (74) must match the size of tensor b (63) at non-singleton dimension 0

jchenghu commented 1 year ago

Hi @shahizat

@jchenghu, imho, an additional code snippet for TensorRT-based engine file inference using "import tensorrt as trt" should be included. This is because Onnx models are unable to fully leverage the potential of Nvidia-based GPUs.

Thank you for the suggestion. Building the TensorRT engine and performing an inference comparison, which is currently left to the user, would indeed complete the conversion.

Unfortunately, at the present moment I can't perform tests because of maintenances and deadlines. Would it be fine for you if I worked on it in about 5 days?

Also just recently, i was not able convert our .pth model in Kazakh language to onnx format, your English model I can. Please see below error: return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] Loading model... Traceback (most recent call last): File "convert2onnx.py", line 90, in <module> partially_load_state_dict(model, checkpoint['model_state_dict']) File "/home/admin/Projects/ExpansionNet_v2/utils/saving_utils.py", line 104, in partially_load_state_dict own_state[name].copy_(param) RuntimeError: The size of tensor a (74) must match the size of tensor b (63) at non-singleton dimension 0

This is due to the "hard-coded" max sequence length (to avoid the dataset requirement in the demo and onnx conversion), 74 was the case of english COCO. Replacing the argument with 63 should be able to fix it :-)

shahizat commented 1 year ago

Hi @jchenghu, thanks for your support. It fixed the error with our Kazakh model. I prepared a code snippet for inference using TensorRT engine file, but unfortunately it raised the warnings and errors. Please have a look when you have a free time.

run.py:53: DeprecationWarning: Use get_tensor_shape instead.
  size = trt.volume(engine.get_binding_shape(binding)) * batch_size
run.py:54: DeprecationWarning: Use get_tensor_dtype instead.
  dtype = trt.nptype(engine.get_binding_dtype(binding))
run.py:61: DeprecationWarning: Use get_tensor_mode instead.
  if engine.binding_is_input(binding):
Traceback (most recent call last):
  File "run.py", line 71, in <module>
    cuda.memcpy_htod_async(inp['device'], inp['host'], stream)
pycuda._driver.LogicError: cuMemcpyHtoDAsync failed: invalid argument

Here is the code

import torch
import numpy as np    
import pickle
import torchvision
from PIL import Image as PIL_Image
from utils.language_utils import tokens2description
import time
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

img_size = 384

with open('./demo_material/demo_coco_tokens.pickle', 'rb') as f:
    coco_tokens = pickle.load(f)
    sos_idx = coco_tokens['word2idx_dict'][coco_tokens['sos_str']]
    eos_idx = coco_tokens['word2idx_dict'][coco_tokens['eos_str']]

# Pre-Processing
def preprocess_image(image_path):
    transf_1 = torchvision.transforms.Compose([torchvision.transforms.Resize((img_size, img_size))])
    transf_2 = torchvision.transforms.Compose([torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                                std=[0.229, 0.224, 0.225])])

    pil_image = PIL_Image.open(image_path)
    if pil_image.mode != 'RGB':
        pil_image = PIL_Image.new("RGB", pil_image.size)
    preprocess_pil_image = transf_1(pil_image)
    image = torchvision.transforms.ToTensor()(preprocess_pil_image)
    image = transf_2(image)
    return image.unsqueeze(0)

# Build TensorRT engine
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)

def build_engine(model_path):
    with open(model_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

# we test the generalization of the graph by testing on two images
image_1 = preprocess_image('./demo_material/napoleon.jpg')
# generate optimized graph
print("Testing first image on TensorRT")
engine = build_engine('./trt_fp16.engine')
context = engine.create_execution_context()
batch_size = 1
inputs, outputs, bindings, stream = [], [], [], cuda.Stream()
for binding in engine:

    size = trt.volume(engine.get_binding_shape(binding)) * batch_size
    dtype = trt.nptype(engine.get_binding_dtype(binding))
    # Allocate host and device buffers
    host_mem = cuda.pagelocked_empty(size, dtype)
    device_mem = cuda.mem_alloc(host_mem.nbytes)
    # Append the device buffer to device bindings.
    bindings.append(int(device_mem))
    # Append to the appropriate list.
    if engine.binding_is_input(binding):
        inputs.append({'host': host_mem, 'device': device_mem})
    else:
        outputs.append({'host': host_mem, 'device': device_mem})
# Set input values
inputs[0]['host'] = image_1.numpy().ravel()
inputs[1]['host'] = np.array([0])
inputs[2]['host'] = np.array([sos_idx])
# Transfer input data to the GPU.
for inp in inputs:
    cuda.memcpy_htod_async(inp['device'], inp['host'], stream)
# Execute model
start = time.time()
context.execute_async_v2(batch_size=batch_size,bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
for out in outputs:
    cuda.memcpy_dtoh_async(out['host'], out['device'], stream)
# Synchronize the stream
stream.synchronize()
output_caption = tokens2description(outputs[0]['host'].tolist(), coco_tokens['idx2word_list'], sos_idx, eos_idx)
print(f"inference time = {time.time - start}")
print(output_caption)
shahizat commented 1 year ago

@jchenghu, I have resolved the issue mentioned earlier. If you would like, you can use the code I've created. I am pleased to share that it is functional and ready for use.

I used Nvidia PyTorch NGC 23.01 container with NVIDIA TensorRT 8.5.2.2 for that purpose.

Here is working code:

import torch
import numpy as np    
import pickle
import torchvision
from PIL import Image as PIL_Image
from utils.language_utils import tokens2description
import time
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

img_size = 384

with open('./demo_material/demo_coco_tokens.pickle', 'rb') as f:
    coco_tokens = pickle.load(f)
    sos_idx = coco_tokens['word2idx_dict'][coco_tokens['sos_str']]
    eos_idx = coco_tokens['word2idx_dict'][coco_tokens['eos_str']]

# Pre-Processing
def preprocess_image(image_path):
    transf_1 = torchvision.transforms.Compose([torchvision.transforms.Resize((img_size, img_size))])
    transf_2 = torchvision.transforms.Compose([torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                                                                std=[0.229, 0.224, 0.225])])

    pil_image = PIL_Image.open(image_path)
    if pil_image.mode != 'RGB':
        pil_image = PIL_Image.new("RGB", pil_image.size)
    preprocess_pil_image = transf_1(pil_image)
    image = torchvision.transforms.ToTensor()(preprocess_pil_image)
    image = transf_2(image)
    return image.unsqueeze(0)

# Build TensorRT engine
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)

def build_engine(model_path):
    with open(model_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

# we test the generalization of the graph by testing on two images
image_1 = preprocess_image('./demo_material/napoleon.jpg')
# generate optimized graph
print("Testing first image on TensorRT")
engine = build_engine('./trt_fp.engine')
context = engine.create_execution_context()
batch_size = 1
inputs, outputs, bindings, stream = [], [], [], cuda.Stream()
for binding in engine:

    size = trt.volume(engine.get_binding_shape(binding)) * batch_size
    dtype = trt.nptype(engine.get_binding_dtype(binding))
    # Allocate host and device buffers
    host_mem = cuda.pagelocked_empty(size, dtype)
    device_mem = cuda.mem_alloc(host_mem.nbytes)
    # Append the device buffer to device bindings.
    bindings.append(int(device_mem))
    # Append to the appropriate list.
    if engine.binding_is_input(binding):
        inputs.append({'host': host_mem, 'device': device_mem})
    else:
        outputs.append({'host': host_mem, 'device': device_mem})

# Set input values
inputs[0]['host'] = np.ravel(image_1).astype(np.float32)
inputs[1]['host'] = np.array([0]).astype(np.int32)
inputs[2]['host'] = np.array([sos_idx]).astype(np.int32)
# Transfer input data to the GPU.
for inp in inputs:
    cuda.memcpy_htod_async(inp['device'], inp['host'], stream)
# Execute model
context.execute_async(batch_size=batch_size,bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
for out in outputs:
    cuda.memcpy_dtoh_async(out['host'], out['device'], stream)
# Synchronize the stream
stream.synchronize()
output_caption = tokens2description(outputs[0]['host'].tolist(), coco_tokens['idx2word_list'], sos_idx, eos_idx)
print(output_caption)
shahizat commented 1 year ago

@jchenghu sorry again, food for thought, my FP32 TensorRT model is working correctly rather than FP16. Output of FP32 model is A painting of a man riding a horse. Output of FP16 model is: Zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini zucchini.

jchenghu commented 1 year ago

Hi @shahizat

Sorry for the waiting, I'm almost back on track.

I have resolved the issue mentioned earlier. If you would like, you can use the code I've created

That's great, thank you for the code snippet. I'll gladly add it to the project (and make sure to give you the credits).

FP32 TensorRT model is working correctly rather than FP16.

I'll try it soon and see if I can reproduce the latest issue.

shahizat commented 1 year ago

@jchenghu Hi, you are welcome, do you have any plans to make it multi modal? Or might be adding the functionality of VQA after performing image to text.

jchenghu commented 1 year ago

Hi,

I managed to reproduce the issue in FP16. The output probabilities are: [0.0, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf] my bet is that It is caused by underrepresentation at some point of the network at time step 0 and overflow in the following ones. I'm still currently looking for a solution.

Edit of 15/03/2023: my hypothesis was wrong, removing almost all encoding and decoding operations the results are still incorrect (the same as above), further investigating on the issue. I'll keep you updated.


do you have any plans to make it multi modal? Or might be adding the functionality of VQA after performing image to text

Yes, we do. Unfortunately, we cannot share too many details on the matter. I can only guarantee it will be open and it could be part of a bigger project but it is currently hard to predict when it will be released. My goal would be this year, but it is hard to say at the current time.

shahizat commented 1 year ago

Hi @jchenghu, imho FP16, it's a nice-to-have but not a must-have. Anyway you did a great job. Thanks a lot!

jchenghu commented 1 year ago

Thank you!

Update: sorry for the waiting, I'm very close to solving the issue, expect the commit soon :-)

shahizat commented 1 year ago

Hi @jchenghu , please add me to your LinkedIn Network(https://www.linkedin.com/in/shakhizat-nurgaliyev/). Also, I am waiting when https://viper.cs.columbia.edu/ will be released.

jchenghu commented 1 year ago

Hi,

Just when I thought I was close, an even more obscure bug popped out...

[TensorRT] WARNING: TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 111.0.3 [TensorRT] WARNING: Detected invalid timing cache, setup a local cache instead Process finished with exit code -1

Therefore, sorry to say this but I'm still investigating... (was good to hear it was not an urgent matter for you) but I Will eventually solve it :-)

Hi @jchenghu , please add me to your LinkedIn Network(https://www.linkedin.com/in/shakhizat-nurgaliyev/).

Actually, I'm not active on LinkedIn unfortunately, if you want to keep in touch for any reason, like get updated on my works (when I'll be able to release them in public) you can email me here at crywhirt@gmail.com (it's just my proxy mail open to the internet, I'll send you the "real" one in case of poke). Needless to say, in case of other problems/matters feel free to open a new issue.

Also, I am waiting when https://viper.cs.columbia.edu/ will be released.

That's a beautiful idea and a nice paper, is this the reason you were interested in VQA? I cannot share many details but I'm currently working on supervised VQA systems

As always, I will keep you updated. Jia


Update 23/03/23: A few days ago I found that the problem lies in the masking functions, all of them cause the problem mentioned above. The current version was designed ad-hoc for TensorRT in FP32 case (which disliked the initial formulation). Therefore, I need to find a new formulation accepted by both FP32 and FP16. Unfortunately, one single attempt takes several hours so It will take some time...

Update 27/03/23: I decided to move the creation of the masks outside of the ONNX graph. Additionally, it seems like masked_fill is disliked for some reason, and multiplying by the mask seems to be a better choice. Lastly, these days I'm unable to perform experiments because of a fever, thank you for your patience.

Update 04/04/23: Back on it :-)

Update 9/04/23: Currently facing a very strange issue, wherever I call "masked_fill" the result changes past the second time step, when I expect it to be the same whether I multiply or use masked_fill since it they are only one and zeros.

Update 11/04/23: I've found a new direction of possible causes, I hope it guides me to the solution soon. Best.

Update 14/04/23: I managed to solve the above issue, but the problem now is that FP16 and FP32 provide two different descriptions of the same image: napoleon.jpg FP32: A painting of a man riding a horse. probs: [0.0, -4.2e-06, -8.96e-05, 0.0, -1.19e-06, -0.75, -0.0013, -1.52e-0, ... FP16: A person standing on a field with a frizbee. probs: [0.0, 0.0, -0.50, -0.18, -0.192, -3.814e-06, -2.52, -2.479e-05, 0.0, -3.099, -0.00022, ....

shahizat commented 1 year ago

hi, @jchenghu I'm pleased to see that you are in the final stage of resolving the issue. I was also able to convert using below command and mix the fp16 and int8 precisions. So it means model size can be reduced dramatically as well. trtexec --onnx=model.onnx --saveEngine=xxx.trt --int8 --fp16. In case if you are succeed, I strongly believe that this will significantly improve the versatility of your project, making it more attractive and well-known among developers.

shahizat commented 1 year ago

hi @jchenghu, I was aware that from floating point to integer quantization can cause a high degradation in accuracy, but not from FP32 to FP16. For many neural networks, FP16 should achieve the same accuracy as FP32. Maybe, I am mistaken here.

jchenghu commented 1 year ago

I agree it's odd, from FP32 to FP16 shouldn't change that much. More testing needs to be done. Thank you for sharing your experience :-) and sorry If it's taking me a while

Update 21/04/23 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Until now, I've attempted the following patterns: 1) create model in fp32 -> convert in ONNX graph -> convert into FP16 only during tensorrt export -> Fail 2) create model in fp32 -> convert in ONNX graph -> convert a second time into fp16 using onnxconverter_common library -> convert onto tensorrt engine with the FP16 selection -> Fail

Introduced two new patterns: 1) converting the model in fp16 first -> convert in ONNX graph (in CUDA, because of unsupported operands) -> convert onto tensorrt engine 2) concerto the model in fp16 first -> convert in ONNX graph -> convert a second time into fp16 using onnxconverter_common library -> convert into tensorrt engine

unfortunately, with no luck.

Additionally, I'm busy with a deadline these weeks, and because of that I'm slowed down on finding new solutions, I hope that time it's still not an urgent matter for you

jchenghu commented 1 year ago

Hi @shahizat, it's been a long time

I just wanted to notify I'm back to investigating the issue these days, I'm trying to tweak the numerically sensitive parts to find the reason behind the different results between fp32 and fp16. However, it seems to be a not-so-rare problem with TensoRT,

I hope the "real" solution is not too far beyond my reach at the moment.

How are you doing? Any issue with FP32?

Best regards, Jia

jchenghu commented 1 year ago

After several attempts to tweak the numerical stability, re-installations, and additional libraries, results in FP16 were still rather odd and always conflicted with the FP32 case.

I still don't know if it's an unlucky weight configuration of my model, some library version-related issue or my particular onnx graph is not well received by tensorRT.

I wonder if training from scratch the model in FP16 instead of FP32 would enable the conversion in FP16 in tensorRT... For the time being, I've uploaded the FP32 conversion and the progress in the case of FP16 for anyone interested.

Best regards, Jia

shahizat commented 1 year ago

Hi @jchenghu, thanks for your detailed reponse. So far, I haven't observed any issue with the FP32 model yet. Anyway, you did a great job.