Problem decoding batch inference with the new optimized NMS

jstumpin commented 1 year ago

Single inference is working but batch inference is failing (only the first instance is successful) when using this commit: New optimized NMS. I've been using this code snippet to perform decoding and it's been working prior to the said commit: https://github.com/NVIDIA-AI-IOT/yolo_deepstream/blob/main/tensorrt_yolov4/source/SampleYolo.cpp#L804-L845:

std::vector<std::vector<BoundingBox>> SampleYolo::get_bboxes(int batch_size, int keep_topk,
    int32_t *num_detections, float *nmsed_boxes, float *nmsed_scores, float *nmsed_classes)
{
    int n_detect_pos = 0;
    int box_pos = 0;
    int score_pos = 0;
    int cls_pos = 0;

    std::vector<std::vector<BoundingBox>> bboxes {static_cast<size_t>(batch_size)};

    for (int b = 0; b < batch_size; ++b)
    {
        for (int t = 0; t < keep_topk; ++t)
        {
            if (static_cast<int>(nmsed_classes[cls_pos + t]) < 0)
            {
                break;
            }

            int box_coord_pos = box_pos + 4 * t;
            float x1 = nmsed_boxes[box_coord_pos];
            float y1 = nmsed_boxes[box_coord_pos + 1];
            float x2 = nmsed_boxes[box_coord_pos + 2];
            float y2 = nmsed_boxes[box_coord_pos + 3];

            bboxes[b].push_back(BoundingBox {
                std::min(x1, x2),
                std::min(y1, y2),
                std::max(x1, x2),
                std::max(y1, y2),
                nmsed_scores[score_pos + t],
                static_cast<int>(nmsed_classes[cls_pos + t]) });
        }

        n_detect_pos += 1;
        box_pos += 4 * keep_topk;
        score_pos += keep_topk;
        cls_pos += keep_topk;
    }

    return bboxes;
}

Adapting to the said commit brings me up to this:

std::vector<std::vector<BoundingBox>> SampleYolo::get_bboxes(int batch_size,
    int *num_detections, float *nmsed_boxes, float *nmsed_scores, int *nmsed_classes)
{
    int detect_pos = 0;
    int box_pos = 0;

    std::vector<std::vector<BoundingBox>> bboxes {static_cast<size_t>(batch_size)};

    for (int b = 0; b < batch_size; ++b)
    {
        for (int t = 0; t < num_detections[b]; ++t)
        {
            int box_coord_pos = box_pos + 4 * t;
            float x1 = nmsed_boxes[box_coord_pos];
            float y1 = nmsed_boxes[box_coord_pos + 1];
            float x2 = nmsed_boxes[box_coord_pos + 2];
            float y2 = nmsed_boxes[box_coord_pos + 3];

            bboxes[b].push_back(BoundingBox {
                std::min(x1, x2),
                std::min(y1, y2),
                std::max(x1, x2),
                std::max(y1, y2),
                nmsed_scores[detect_pos + t],
                static_cast<int>(nmsed_classes[detect_pos + t]) });
        }

        detect_pos += num_detections[b];
        box_pos += 4 * num_detections[b];
    }

    return bboxes;
}

Size of num_detections is correct, i.e. = batch_size so only num_detections is holding the right values. nmsed_boxes, nmsed_scores and nmsed_classes do not hold any value beyond num_detections[0], e.g. nmsed_classes[num_detections[0] - 1] is correct but nmsed_classes[num_detections[1] - 1] is NULL.

Not using DeepStream, hence why I'm using NVIDIA's standalone version in the inference front. Merging with your repo for a wider YOLO variant support. Anything else that I missed @marcoslucianops?

jstumpin commented 1 year ago

Increment value was wrong:

std::vector<std::vector<BoundingBox>> SampleYolo::get_bboxes(int batch_size, int output_size
    int *num_detections, float *nmsed_boxes, float *nmsed_scores, int *nmsed_classes)
{
    int detect_pos = 0;
    int box_pos = 0;

    std::vector<std::vector<BoundingBox>> bboxes {static_cast<size_t>(batch_size)};

    for (int b = 0; b < batch_size; ++b)
    {
        for (int t = 0; t < num_detections[b]; ++t)
        {
            int box_coord_pos = box_pos + 4 * t;
            float x1 = nmsed_boxes[box_coord_pos];
            float y1 = nmsed_boxes[box_coord_pos + 1];
            float x2 = nmsed_boxes[box_coord_pos + 2];
            float y2 = nmsed_boxes[box_coord_pos + 3];

            bboxes[b].push_back(BoundingBox {
                std::min(x1, x2),
                std::min(y1, y2),
                std::max(x1, x2),
                std::max(y1, y2),
                nmsed_scores[detect_pos + t],
                static_cast<int>(nmsed_classes[detect_pos + t]) });
        }

        detect_pos += output_size; //rectified increment
        box_pos += output_size * 4; //rectified increment
    }

    return bboxes;
}

where output_size (reference: https://github.com/NVIDIA/TensorRT/blob/release/8.6/samples/common/buffers.h#L313-L319):

int index = mEngine->getBindingIndex("detection_classes");
output_size = mManagedBuffers[index]->hostBuffer.size() / batch_size;

Indeed, the new optimized NMS is able to shave a good chunk of milliseconds off my batch inference (batch_size = 8, 512x512, YOLOv4x). Keep up the good work, thanks! @marcoslucianops

marcoslucianops commented 1 year ago

The output of the YOLO model on this repo is adjusted to get more performance on DeepStream. It's not equal to other implementations.

marcoslucianops / DeepStream-Yolo

Problem decoding batch inference with the new optimized NMS #325