Unsupported ONNX data type: UINT8 (2)

bnascimento commented 4 years ago

Folowing the tutorial from the notebook https://github.com/onnx/tensorflow-onnx/blob/master/tutorials/ConvertingSSDMobilenetToONNX.ipynb I am trying to work with a mobilenetv2 and v3 frozen models from tensorflow frozen_inference_graph.pb or a saved_model.pb to convert to ONNX and to TensorRT files. Under NGC dockers 20.01-tf1-py3 and 19.05-py3 I am using both this and tensorflow-onnx projects. I alwaysget different issues, the furthest I got was under 20.01-tf1-py3 with both onnx-tensorrt and tensorflow-onnx on master branchs and install the projects from source. I was able to create the .onnx file, but when I try to create the .trt file I get the following.

onnx2trt /media/bnascimento/project/frozen_inference_graph.onnx -o /media/bnascimento/project/frozen_inference_graph.trt
----------------------------------------------------------------
Input filename:   /media/bnascimento/project/frozen_inference_graph.onnx
ONNX IR version:  0.0.6
Opset version:    10
Producer name:    tf2onnx
Producer version: 1.6.0
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
Parsing model
Unsupported ONNX data type: UINT8 (2)
ERROR: image_tensor:0:190 In function importInput:
[8] Assertion failed: convertDtype(onnxDtype.elem_type(), &trtDtype)

I suspect this has to do with the input tensor for the image, but I dont know how to avoid this issue. Anyone with similar issues before?

Cheers Bruno

SthPhoenix commented 2 years ago

Do you have another forward pass mechanism that allows you to get the desired outputs ?

Am using this script: https://github.com/SthPhoenix/InsightFace-REST/blob/master/src/api_trt/modules/model_zoo/exec_backends/trt_loader.py I'll try to provide more info shortly.

I can't reproduce this with TensorRT 8.0.3 :/ I always get nans. I guess they have corrected this bug.

I found this bug just today with TensorRT 8.0.3 inside TensorRT:21.09 docker image

BTW: Ar you sure this input shape is valid? (1, 3, 35, 224, 224)

SthPhoenix commented 2 years ago

My original network requires image normalized like this: blob = cv2.dnn.blobFromImage(img, 1.0 / 128, input_size, (127.5, 127.5, 127.5), swapRB=True)

Which might be replaced in pytorch network definition like this:

def forward(self, x):
    x = x - torch.Tensor([127.5])
    x = torch.mul(x, torch.Tensor([1/128]))
   ...

With this addition network now needs original pixel values in range 0-255, which is actually uint8. Now in Netron ONNX input looks like this:

Screenshot from 2021-10-12 21-21-37

So updated model now needs only this preprocessing steps:

blob = np.transpose(img, (2, 0, 1))
blob = np.expand_dims(blob, axis=0).astype(np.uint8)

Also when allocating space for TensorRT input you should leave default fp32:

self.inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.float32)

dmenig commented 2 years ago

Ahem, those inputs aren't uint8 are they ? ^^ You said :

if you have created engine with fp32 input it accepts uint8 input without any warnings.

But you do feed float32 to TensorRT !

input.flatten(order='C').astype(np.float32)

I believe you have the same conversion problem than everybody else in this discussion.

dmenig commented 2 years ago

BTW: Ar you sure this input shape is valid? (1, 3, 35, 224, 224)

Yes, the network I'm using is not a 2d resnet, but a 3d one.

SthPhoenix commented 2 years ago

Ahem, those inputs aren't uint8 are they ? ^^ You said :

if you have created engine with fp32 input it accepts uint8 input without any warnings.

But you do feed float32 to TensorRT !

input.flatten(order='C').astype(np.float32)

I believe you have the same conversion problem than everybody else in this discussion.

I have double checked my code, yes, you indeed need to change float32 to uint8 in this part.

dmenig commented 2 years ago

I have double checked my code, yes, you indeed need to change float32 to uint8 in this part.

I think you mean change uint8 to float32

SthPhoenix commented 2 years ago

No, in my case everything works with uint8. You must ensure that your network has all the preprocessing and normalization steps inside, the same as you have done on CPU before, otherwise it will either produce wrong results, either Nulls

dmenig commented 2 years ago

No, in my case everything works with uint8. You must ensure that your network has all the preprocessing and normalization steps inside, the same as you have done on CPU before, otherwise it will either produce wrong results, either Nulls

Can you then please share you actual code ? If I'm reading properly what you're saying, your actual code can't contain :

input.flatten(order='C').astype(np.float32)

can it ?

SthPhoenix commented 2 years ago

Here is example of TRT inference code: trt_loader.py Take note that I have removed .astype completely

Here's wrapper around previous module: trt_backend.py Here I have added parameter self.input_dtype = np.uint8 which is used by actual inference code.

And here is inference: scrfd.py

As you can see I'm reading input_dtype parameter I have previously hardcoded. So image is transposed and converted to uint8, and than fed to TensorRT without further type casts. I have added input_dtype parameter because this code might be used with ONNXRuntime or Triton Inference Server, which wont work if i'll try too feed them with uin8 inputs, unless model have this input type.

Here's google drive link to model working with this approach: scrfd_10g_gnkps_norm.onnx You can check it with Netron.app Also you can try running image from my repo by changing this line to det_model=scrfd_10g_gnkps_norm

dmenig commented 2 years ago

I was able to reproduce this with my network. Thank you very much !

Bad news. I believe I was able to determine that the computation still happens on CPU ! :/ If I define

import time
import numpy as np
input = np.array(
    np.random.choice(range(256), (1, 3, 35, 224, 224)),
    dtype=np.uint8,
)
allocate_place = np.prod(input.shape)

Then

t=time.time()
for i in range(1000):
    inputs[0].host = input.astype(np.float32)
print(time.time()-t)

takes 1.8s

t=time.time()
for i in range(1000):
    inputs[0].host = input
print(time.time()-t)

takes 0.001 second

t=time.time()
for i in range(1000):
    inputs[0].host[:allocate_place] = input.flatten(order='C')
print(time.time()-t)

takes 2.09 seconds

And

for i in range(1000):
    _ = input.flatten(order='C')
print(time.time()-t)

takes 0.21 second.

And in case you ask me : inputs[0].host = input.flatten(order='C') leads to NaNs. The [:allocate_place] part is necessary.

SthPhoenix commented 2 years ago

Bad news. I believe I was able to determine that the computation still happens on CPU ! :/

That's totally normal, this part of code is responsible for allocating data on host (CPU), which would be consumed by device (GPU).

And in case you ask me : inputs[0].host = input.flatten(order='C') leads to NaNs. The [:allocate_place] part is necessary.

I haven't removed this part, I have only removed type cast part.

dmenig commented 2 years ago

That's totally normal, this part of code is responsible for allocating data on host (CPU), which would be consumed by device (GPU).

... then what's the point of mentionning this ? It is even slower than converting your data to fp32.

SthPhoenix commented 2 years ago

That's totally normal, this part of code is responsible for allocating data on host (CPU), which would be consumed by device (GPU).

... then what's the point of mentionning this ? It is even slower than converting your data to fp32.

Excuse me, I have just lost track of your idea.

I have improved performance of my full pipeline for about 5-10% with proposed fix, looks like you are measuring something irrelevant to actual performance

dmenig commented 2 years ago

That's totally normal, this part of code is responsible for allocating data on host (CPU), which would be consumed by device (GPU).

... then what's the point of mentionning this ? It is even slower than converting your data to fp32.

Excuse me, I have just lost track of your idea.

I have improved performance of my full pipeline for about 5-10% with proposed fix, looks like you are measuring something irrelevant to actual performance

The turn the discussion has taken since @drewm1980 's comment is that the current lack of support of ONNX data type UINT8 forces us to convert uint8 to fp32/fp16/int8 on CPU (which is CPU intensive) before feeding our data to our model, even though UINT8 is the most common data type for images. What you pointed out is that : if you remove the user conversion, there is still a way to code the forward pass so as to put the conversion under the hood. That, to me, is equivalent to having a user conversion, ain't it ?

I mean, great if you had a speedup, but looking at my time measurements, I doubt the pinpointed line is the culprit.

SthPhoenix commented 2 years ago

The turn the discussion has taken since @drewm1980 's comment is that the current lack of support of ONNX data type UINT8 forces us to convert uint8 to fp32/fp16/int8 on CPU (which is CPU intensive) before feeding our data to our model, even though UINT8 is the most common data type for images. What you pointed out is that : if you remove the user conversion, there is still a way to code the forward pass so as to put the conversion under th me, is equivalent to having a user conversion, ain't it ?

I mean, great if you had a speedup, but looking at my time measurements, I doubt the pinpointed line is the culprit.

I'll try to check you results later, but I still believe conversion happens on the GPU side. I figured out this behavior in multiple GPU environment when I've got hit pcie bus bandwidth limit, this fix helped me get multiple GPUs operating at full capacity and giving 5-10% performance boost per GPU, while without it I've seen performance degradation after second GPU.

BTW: I think you have missed main check:

t=time.time()
for i in range(1000):
    inputs[0].host[:allocate_place] = input.flatten(order='C')
print(time.time()-t)

vs

t=time.time()
for i in range(1000):
    inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.float32)
print(time.time()-t)

dmenig commented 2 years ago

I didn't put this check because it is equivalent to inputs[0].host = input.astype(np.float32), which is in fact shorter. Isn't that in fact the speedup you said you've experienced ?

dmenig commented 2 years ago

I'll try to check you results later, but I still believe conversion happens on the GPU side. I figured out this behavior in multiple GPU environment when I've got hit pcie bus bandwidth limit, this fix helped me get multiple GPUs operating at full capacity and giving 5-10% performance boost per GPU, while without it I've seen performance degradation after second GPU.

The fact that it helps you achieving full capacity on GPUs doesn't mean that the operation is GPU bound. In fact it is much more consistent with the CPU bottlenecking the GPU in this sequential step.

Also, I've looked at the GPU and CPU metrics while running these for loops, and the GPU doesn't show any utilization through nvidia-smi, while the CPU does.

dmenig commented 2 years ago

I've noticed that it is much quicker to convert from uint8 to int8, than from uint8 to fp32, which could be interesting since int8 is a tensorrt input format. But I don't think it works if the model isn't quantized. Does it ?

SthPhoenix commented 2 years ago

The fact that it helps you achieving full capacity on GPUs doesn't mean that the operation is GPU bound. In fact it is much more consistent with the CPU bottlenecking the GPU in this sequential step.

CPU bottleneck wasn't the case, each GPU was set to use specific CPU cores using CPU affinity, and when I was trying to use more than 2 GPUs CPU usage actually becomes much lower, indicating that something else was bottleneck.

Also, I've looked at the GPU and CPU metrics while running these for loops, and the GPU doesn't show any utilization through nvidia-smi, while the CPU does.

At this step GPU isn't used at all, GPU kicks in only when you call this part of code:

[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    [
        cuda.memcpy_dtoh_async(out.host, out.device, stream)
        for out in outputs
    ]
    stream.synchronize()

dmenig commented 2 years ago

Exactly. So even though we observe the differences I pointed out of time measurement, you don't think the conversion happens at the pinpointed line ?

Even then, the optimization is pointless, because your way of writing it introduces an high overhead at the said line, higher than actually converting it yourself, as measured.

dmenig commented 2 years ago

I must insist that I believe that the speedup you claim to have seen comes from the fact that inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.float32) is slower than inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.uint8).

That actually is moot here, since inputs[0].host = input.astype(np.float32) is even faster.

Note that inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.uint8) is slower than inputs[0].host[:allocate_place] = input.flatten(order='C'), which already works.

SthPhoenix commented 2 years ago

I must insist that I believe that the speedup you claim to have seen comes from the fact that inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.float32) is slower than inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.uint8).

Sure, using .astype(np.uint8) on uint8 is much faster than .astype(np.float32), but you missing one thing: in first case you are sending uint8 data to GPU, which is 4 times smaller than float32, that means smaller data volume transferred between CPU and GPU, as pinpointed by @simunovic-antonio

That actually is moot here, since inputs[0].host = input.astype(np.float32) is even faster.

But as you stated before in this case result is NaNs:

And in case you ask me : inputs[0].host = input.flatten(order='C') leads to NaNs. The [:allocate_place] part is necessary.

So this comparison is useless.

Note that inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.uint8) is slower than inputs[0].host[:allocate_place] = input.flatten(order='C'), which already works.

No conversion is always faster than any conversion.

SthPhoenix commented 2 years ago

Try running this test:

    input = np.array(
        np.random.choice(range(255), (1, 3, 640, 640)),
        dtype=np.float32,
    )

    allocate_place = np.prod(input.shape)

    t = time.time()
    for i in range(10000):
        inputs[0].host[:allocate_place] = input.flatten(order='C')

        [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        [
            cuda.memcpy_dtoh_async(out.host, out.device, stream)
            for out in outputs
        ]
        stream.synchronize()
    print(time.time() - t)

Changing input dtype between np.float32 and np.uint8 and without any type conversion inside loop. You could see a slight performance gain in uint8 case, while no conversion steps took place at all. Also if you examine GPU usage during tests you should notice that with uint8 GPU usage is a bit higher, in my case it's 86% against 74%

dmenig commented 2 years ago

Sure, using .astype(np.uint8) on uint8 is much faster than .astype(np.float32), but you missing one thing: in first case you are sending uint8 data to GPU, which is 4 times smaller than float32, that means smaller data volume transferred between CPU and GPU, as pinpointed by @simunovic-antonio

That's my point, you're not actually sending uint8 data to GPU, you're still converting it on CPU with this line ! It's just that you didn't write the conversion yourself.

Even if you claim not to be converting it on CPU with this line : this line takes even longer than actually converting the data yourself on CPU, and shows no GPU utilization with nvidia-smi, so at best you're converting data on CPU, at worst you're introducing even more CPU overhead, to convert data on GPU.

dmenig commented 2 years ago

That actually is moot here, since inputs[0].host = input.astype(np.float32) is even faster.

But as you stated before in this case result is NaNs:

I didn't, you misread. Please go ahead and read again.

dmenig commented 2 years ago

And in case you ask me : inputs[0].host = input.flatten(order='C') leads to NaNs. The [:allocate_place] part is necessary.

So this comparison is useless.

That is why you'll notice I didn't do the comparison, and there is no time measurement for this line ^^ I wrote this in case you asked why I didn't do this one : because it is useless.

dmenig commented 2 years ago

Note that inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.uint8) is slower than inputs[0].host[:allocate_place] = input.flatten(order='C'), which already works.

No conversion is always faster than any conversion.

You would be right if no conversion actually happened. I did the time measurements. I'm starting to doubt you good faith in this... :/

dmenig commented 2 years ago

Changing input dtype between np.float32 and np.uint8 and without any type conversion inside loop. You could see a slight performance gain in uint8 case, while no conversion steps took place at all. Also if you examine GPU usage during tests you should notice that with uint8 GPU usage is a bit higher, in my case it's 86% against 74%

Once again, seeing an increase in mean GPU utilization in a loop that has CPU operations doesn't mean that you put operations on GPU. It just means that you're using GPU at a higher rate. In this case I've measured exactly which of your lines reduced CPU utilization, which entail that you spend more time in proportion in your loop using the GPU. Say GPU time represents 80% of the work and CPU time 20% of the work. Say a single iteration is 1s long. Tg+Tc = 1s Tg = 0.8s Tc = 0.2s You'll then see mean GPU utilization of 80% Now say you've optimized CPU utilization (which I argue is the only thing you did) : you now have Tc = 0.1s Tg+Tc = 0.9s You'll see mean GPU utilization of (0.8s / 0.9s =~) 89%. My point is : increased GPU utilization doesn't mean you've transferred operations on GPU. It just means your GPU does operations more frequently, but they might be the exact same operations. In order to really determine that the GPU is doing more operations you'd have to do more work. For example isolating the GPU operations in another thread, or finding a way to list GPU operations happening in real time.

I ran your lines, measuring GPU utilization at the same time. And the most telling thing that comes out, is that :

Using my model and these lines I get

(1)

cpu_time = 0
gpu_time = 0
num_iterations = 100 
for i in range(num_iterations):
    t1 = time.time()
    inputs[0].host[:allocate_place] = input.flatten(order="C")
    t2 = time.time()
    _ = [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
    _ = [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    stream.synchronize()
    t3 = time.time()
    cpu_time+=t2-t1
    gpu_time+=t3-t2

print("Average CPU time :", round(1000 * cpu_time/num_iterations, 2), "ms")
print("Average GPU time :", round(1000 * gpu_time/num_iterations, 2), "ms")

Average CPU time : 1.97 ms
Average GPU time : 63.55 ms
I report with nvidia-smi around 95-96% utilization.

Using instead (2)

    inputs[0].host[:allocate_place] = input.flatten(order="C").astype(np.float32)

I get : Average CPU time : 4.12 ms Average GPU time : 63.51 ms I report with nvidia-smi around 93% utilization.

And now using this line : (3)

    inputs[0].host = input.astype(np.float32)

Average CPU time : 1.32 ms Average GPU time : 63.96 ms

I report with nvidia-smi around 95-96% utilization.

(1) + (2) shows exactly what I've been talking about. You've only reduced CPU computations, which makes your GPU mean utilization climb, because its functions are call more often, but that does not mean the GPU is doing computations that the CPU did at first. It just means we're waiting less for the CPU.

(3) shows that if you do the conversion yourself to float32 in this way, you're even faster. Depending on how CPU bound your task is, I suggest you try it. I can't determine that it means that (1) is doing the conversion on CPU, but it does say that the question is moot, because some superfluous computation on the CPU has been added !

SthPhoenix commented 2 years ago

Well, I have ran more tests... looks like you right about volume sent to GPU being the same.

I'm still not sure where actual conversion took place, but calling inputs[0].host[:allocate_place] = input.flatten(order="C") with uint8 dtype seems to write data to either part of preallocated buffer, either indeed calls type cast under the hood, but amount of data sent to GPU is same for both cases, with a little increase in uint8 case, due to increased CPU performance you pointed out before.

I have measured pcie<=>GPU Tx/Rx speeds with nvidia-smi dmon -s t and there is actual 4x difference when I'm using onnxruntime-gpu with model with uint8 inputs vs TensorRT with or without proposed fix.

Though now more questions arise in my usage scenario: actual performance gain per GPU in my case is about 5-10%, which might be explained by your previous tests. But it shouldn't lead to overall performance increase of up to 2x in multiple GPU scenarios, especially that with this fix CPU usage actually had grown along with performance.

Hyrtsi commented 2 years ago

I solved this issue by replacing the uint8 in the model by int64. Is this possible in your case?

dmenig commented 2 years ago

I solved this issue by replacing the uint8 in the model by int64. Is this possible in your case?

uint8 is meant to save bandwidth or CPU ressources by not doing conversion. int64 would be even more trouble than fp32

Hyrtsi commented 2 years ago

I solved this issue by replacing the uint8 in the model by int64. Is this possible in your case?

uint8 is meant to save bandwidth or CPU ressources by not doing conversion. int64 would be even more trouble than fp32

I understand your point. But isn't TensorRT converting the weights from int64_min...int64_max to uint8_min...uint8_max? I don't think this will cause a big footprint, especially if there are not many such values in the network.

Is using int8 an option? https://docs.nvidia.com/deeplearning/tensorrt/api/python_api/infer/FoundationalTypes/DataType.html

I think it's quite clear that if uint8 can't be supported then we must convert it to something TensorRT (and ONNX) supports. Plase let me know if this works.

dmenig commented 2 years ago

Seems like it's "on the roadmap" https://github.com/NVIDIA/TensorRT/issues/1503 .

Anyways, I'm concerned by speed issues, but I guess you're only talking about successful execution.

onnx / onnx-tensorrt

Unsupported ONNX data type: UINT8 (2) #400