jkjung-avt / tensorrt_demos

TensorRT MODNet, YOLOv4, YOLOv3, SSD, MTCNN, and GoogLeNet
https://jkjung-avt.github.io/
MIT License
1.75k stars 547 forks source link

DetectNet_v2 tensorrt License Plate Detection (LPDNet) #441

Closed C-monC closed 3 years ago

C-monC commented 3 years ago

Hi,

I have made a tensorrt engine of the model downloadable from here: tlt-converter -k nvidia_tlt -d 3,480,640 -p image_input,1x3x480x640,4x3x480x640,16x3x480x640 usa_pruned.etlt -t fp16 -e lpd_engine.trt

The model is based off of DetectNet_v2. Has anyone managed to get this or a other DetectNet_v2 model working with tensorrt in python?

Here is the code I have so far:

import os
import time

import cv2
#import matplotlib.pyplot as plt
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
import pdb

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine, batch_size=1):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        # pdb.set_trace()
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
            print(f"input: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
            print(f"output: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
    return inputs, outputs, bindings, stream

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(
        batch_size=batch_size, bindings=bindings, stream_handle=stream.handle
    )
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

# TensorRT logger singleton
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_engine_path = "lpd.trt"

trt_runtime = trt.Runtime(TRT_LOGGER)
# pdb.set_trace()
trt_engine = load_engine(trt_runtime, trt_engine_path)
# Execution context is needed for inference
context = trt_engine.create_execution_context()
# This allocates memory for network inputs/outputs on both CPU and GPU
inputs, outputs, bindings, stream = allocate_buffers(trt_engine)

# pdb.set_trace()
image = cv2.imread("car.jpg")
image = cv2.resize(image, (640, 480))

np.copyto(inputs[0].host, image.ravel())

outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
test1 = np.reshape(outputs[0], (4, 30, 40))
test2 = np.reshape(outputs[1], (1, 30, 40))
print(outputs)

Exporting the tensorrt engine and then running this code works. I am just unable to interpret this output. The output is supposedly a

40x30x12 bbox coordinate tensor and 40x30x3 class confidence tensor.

So I then reshape it into these dimensions but the bbox coordinate tensor has decimals which I can't convert to pixel coords.

I hope this is relevant to the repo - I think it could be a nice addition once it works.

jkjung-avt commented 3 years ago

Exporting the tensorrt engine and then running this code works. I am just unable to interpret this output.

I have not used DetectNet_v2 yet, but I think this should be easy to fix. Is there any DetectNet_v2 inference code we can reference? This code could be in either python or C++ or else.

C-monC commented 3 years ago

yes, check this link https://github.com/dusty-nv/jetson-inference/blob/master/c/detectNet.cpp#L815

C-monC commented 3 years ago

The LPDnet docs mention:

The raw normalized bounding-box and confidence detections need to be post-processed by a clustering algorithm such as DBSCAN or NMS to produce the final bounding-box coordinates and category labels.

I tried sklearns DBSCAN clustering but it returns an array of 0's.

jkjung-avt commented 3 years ago

You need to fix the preprocessing code first. Refer to: https://github.com/dusty-nv/jetson-inference/blob/19ed62150b3e9499bad2ed6be1960dd38002bb7d/c/detectNet.cpp#L729

The input tensor for DetectNet seems to be: CHW ordered, RGB, float32, ranged from -1.0 to +1.0. (Please verify whether it's CHW or HWC order by yourself.)

So you should do something like this: (assuming 640x480 is the correct dimension of DetectNet input)

image = cv2.imread("car.jpg")
image = cv2.resize(image, (640, 480))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)         # BGR -> RGB
image = image.transpose((2, 0, 1)).astype(np.float32)  # HWC -> CHW, uint8 -> float32
image = (image - 127.5) / 127.5

np.copyto(inputs[0].host, image.ravel())
jkjung-avt commented 3 years ago

As to postprocessing, refer again to DetectNet code in jetson_inference repo: https://github.com/dusty-nv/jetson-inference/blob/19ed62150b3e9499bad2ed6be1960dd38002bb7d/c/detectNet.cpp#L813-L861

Output tensor 0 should be "conf", while output tensor 1 "bbox". So I think you should do something like:

confs = np.reshape(outputs[0], (-1, 3))
bboxes = np.reshape(outputs[1], (-1, 4))

Where "confs" are confidence scores of the 3 classes for each potential detection, and "bboxes" are coordinates of the potential detections. More specifically, "bboxes" are "(x1, y1, x2, y2)" (or ("Left", "Top", "Right", "Bottom")) coordinates ranged from 0.0 to 1.0. You'll need to multiply them with image width and height, e.g. 640 and 480, to get the pixel coordinates on the original image.

I hope this helps and you'll be able to fix the code by yourself.

C-monC commented 3 years ago

Thanks for the help. Viewing the confidence output matches perfectly with the image.

The bbox output I don't quite understand. Is this the correct process? Use the confidence tensor (30, 40) to select cells above a confidence threshold. Then in the bbox tensor (4, 30, 40) use the selected cells to get the 4 corner coords.

In the LPD_net documentation they mention a "40x30x12 bbox coordinate tensor" but the array won't reshape to that as it has a length of 4800.

C-monC commented 3 years ago

I should be able to fix it myself now. Appreciate the help.

C-monC commented 3 years ago

Sorry to open this again. I am struggling to deal with outputs not being normalised.

The confidence array looks right with this confs = np.reshape(outputs[1], (-1, 3)) I assume the shorter outputs[n] will be confidence. This array is between 0 and 1.

The bboxes bboxes = np.reshape(outputs[0], (-1, 4)) has values ranging from -0.1 to 5. With the following test image:

img

The lengths of the two output arrays also dont match up when reshaped with 3 and 4. len(outputs[0]) = 4800 and len(outputs[1]) = 1200. confs = np.reshape(outputs[1], (-1, 3)) becomes (400, 3) bboxes = np.reshape(outputs[0], (-1, 4)) becomes (1200, 4)

The code I'm using is below:

image = cv2.imread("img.png")
image_orig = cv2.resize(image, (640, 480))

image = cv2.cvtColor(image_orig, cv2.COLOR_BGR2RGB)  # BGR -> RGB
image = image.transpose((2, 0, 1)).astype(np.float32)  # HWC -> CHW, uint8 -> float32
image = (image - 127.5) / 127.5

np.copyto(inputs[0].host, image.ravel())

outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
confs = np.reshape(outputs[1], (-1, 3))
bboxes = np.reshape(outputs[0], (-1, 4))

result = np.where(confs > 0.3)
for index in result[0]:
    rect = bboxes[index]
    test = cv2.rectangle(image_orig, (int(rect[0] * 640), int(rect[1] * 480)), (int(rect[2] * 640), int(rect[3] * 480)),
                         (0, 0, 255), 2)

cv2.imshow("1", test)
jkjung-avt commented 3 years ago

Based on your description, I realized that the number of object classes is only 1 ("license plate"). So you should reshape confs as (-1, 1) array instead.

confs = np.reshape(outputs[1], (-1, 1))   # (1200, 1)
bboxes = np.reshape(outputs[0], (-1, 4))  # (1200, 4)

In addition, since your original image is 640x430. You should use 430 as height in this line of code.

    test = cv2.rectangle(image_orig, (int(rect[0] * 640), int(rect[1] * 430)), (int(rect[2] * 640), int(rect[3] * 430)),
                         (0, 0, 255), 2)
C-monC commented 3 years ago

Oh yes that makes sense.

Do you perhaps know how to deal with the bboxes with a rangenot between 0 and 1? The boxes are now drawn off the image as some values are negative and some above 1.

is it possible this line https://github.com/dusty-nv/jetson-inference/blob/19ed62150b3e9499bad2ed6be1960dd38002bb7d/c/detectNet.cpp#L846 causes this issue? I can't see how it could make the value smaller or positive but it seems I am missing this line.

jkjung-avt commented 3 years ago

is it possible this line https://github.com/dusty-nv/jetson-inference/blob/19ed62150b3e9499bad2ed6be1960dd38002bb7d/c/detectNet.cpp#L846 causes this issue? I can't see how it could make the value smaller or positive but it seems I am missing this line.

Nope. That line is only for calculating the pointer to the corresponding bboxes quadruple. Your python code is matching what detectNet.cpp is doing.

jkjung-avt commented 3 years ago

Do you perhaps know how to deal with the bboxes with a range not between 0 and 1?

You only need to care about bboxes values with which the corresponding conf score is over the threshold. Could you try to print out those values?

C-monC commented 3 years ago

I should've sent the output so far. If you ignore the low confidence values the bounding boxes are all within 0 and 1 - I thought this was coincidental.

image

Could it be that this model is just terrible (above confidence is ~0.5)? The other images have a max confidence of 0.1:

image

The model specs on the Nvidia container registry seem very incorrect.

Model | Dataset | Accuracy
-- | -- | --
usa_unpruned_model | NVIDIA 3k LPD eval dataset | 98.58%
usa_pruned_model | NVIDIA 3k LPD eval dataset | 98.46%  <- the one I am using
ccpd_unpruned_model | 14% of CCPD-Base dataset | 99.24%
ccpd_pruned_model | 14% of CCPD-Base dataset | 99.22%

The full working code thus far for anyone who wanders onto this issue

import os
import cv2
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt

class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

def load_engine(trt_runtime, engine_path):
    with open(engine_path, "rb") as f:
        engine_data = f.read()
    engine = trt_runtime.deserialize_cuda_engine(engine_data)
    return engine

# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine, batch_size=1):
    inputs = []
    outputs = []
    bindings = []
    stream = cuda.Stream()
    for binding in engine:
        # pdb.set_trace()
        size = trt.volume(engine.get_binding_shape(binding)) * batch_size
        dtype = trt.nptype(engine.get_binding_dtype(binding))
        # Allocate host and device buffers
        host_mem = cuda.pagelocked_empty(size, dtype)
        device_mem = cuda.mem_alloc(host_mem.nbytes)
        # Append the device buffer to device bindings.
        bindings.append(int(device_mem))
        # Append to the appropriate list.
        if engine.binding_is_input(binding):
            inputs.append(HostDeviceMem(host_mem, device_mem))
            print(f"input: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
        else:
            outputs.append(HostDeviceMem(host_mem, device_mem))
            print(f"output: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
    return inputs, outputs, bindings, stream

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    # Transfer input data to the GPU.
    [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
    # Run inference.
    context.execute_async(
        batch_size=batch_size, bindings=bindings, stream_handle=stream.handle
    )
    # Transfer predictions back from the GPU.
    [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
    # Synchronize the stream
    stream.synchronize()
    # Return only the host outputs.
    return [out.host for out in outputs]

# TensorRT logger singleton
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_engine_path = "lpd.trt"

trt_runtime = trt.Runtime(TRT_LOGGER)
# pdb.set_trace()
trt_engine = load_engine(trt_runtime, trt_engine_path)
# Execution context is needed for inference
context = trt_engine.create_execution_context()
# This allocates memory for network inputs/outputs on both CPU and GPU
inputs, outputs, bindings, stream = allocate_buffers(trt_engine)

# pdb.set_trace()
image = cv2.imread("img_2.png")
image_orig = cv2.resize(image, (640, 480))

image = cv2.cvtColor(image_orig, cv2.COLOR_BGR2RGB)  # BGR -> RGB
image = image.transpose((2, 0, 1)).astype(np.float32)  # HWC -> CHW, uint8 -> float32
image = (image - 127.5) / 127.5

np.copyto(inputs[0].host, image.ravel())

outputs = do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
confs = np.reshape(outputs[1], (-1, 1))
bboxes = np.reshape(outputs[0], (-1, 4))

scores = []

result = np.where(confs > 0.5)
for index in result[0]:
    rect = bboxes[index]
    test = cv2.rectangle(image_orig, (int(rect[0] * 640), int(rect[1] * 480)), (int(rect[2] * 640), int(rect[3] * 480)),
                         (0, 0, 255), 2)

cv2.imshow("1", test)

cv2.waitKey()
print(outputs)
jkjung-avt commented 3 years ago

confs = np.reshape(outputs[1], (-1, 1)) bboxes = np.reshape(outputs[0], (-1, 4))

Something is not right here...

OUTPUT_CONF is index 0 and OUTPUT_BBOX is 1 in jetson_inference code. And the postprocessing code reads:

        float* conf = mOutputs[OUTPUT_CONF].CPU;
        float* bbox = mOutputs[OUTPUT_BBOX].CPU;

The output indices do not match with your code. Are you sure the LPDNet model you're using could be processed by jetson_inference code??

C-monC commented 3 years ago

Yeah, I also noticed that but the dimensions of the outputs only make sense the way it is now.

If the model is for tensorrt does that mean it will work with jetson_inference? From the models documentation:

These models need to be used with NVIDIA Hardware and Software. For Hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Transfer Learning Toolkit (TLT), DeepStream SDK or TensorRT.

But thank you for your time. I've managed to train a yolov4 license plate detector in the mean time that works sufficiently and will work with this library.

abhinavvsharma commented 1 year ago

@C-monC I am facing similar issues with DetectNet_v2. According to your last comment were you using yolov4 license plate detector in TensorRT format (trt or engine). And can you share the inference code you used for YOLO inference?

C-monC commented 1 year ago

I trained yolov4-tiny from pretrained weights on about 200 labelled license plates it started working okay-ish. The inference code was exactly the same as in this repo's Demo 5. Train the model, convert to trt and then use trt_yolo.py.

DetectNet_v2 is far worse than yolo for most tasks.