CombinedNonMaxSuppression is not supported in ONNX

jan-golda commented 3 years ago

Describe the bug I am trying to convert MaskRCNN in TensorFlow 2 to ONNX but it is failing due to CombinedNonMaxSuppression op being not supported in ONNX.

Urgency It is blocking the use of MaskRCNN in ONNX.

System information

OS Platform and Distribution: Linux Ubuntu 20.04.1
Tensorflow Version: 2.3.1
ONNX version: 1.8.0
Python version: 3.8.5

To Reproduce Try to convert any model that uses tf.image.combined_non_max_suppression

Expected behaviour A model should be converted without a fail on CombinedNonMaxSuppression op.

Additional context This was already reported in #847 for YOLO, and I have tried to apply the workaround from there - replace CombinedNonMaxSuppression with a NonMaxSuppression accompanied by a set of ops that were meant to replace the "Combined" part.

I have tried to get it to work for a few days, but in the case of MaskRCNN, this seems to be more complicated than in the case of YOLO. I had to apply the NonMaxSuppression for each class in each sample from batch separately, then pad it, select the top results for a class, then select the top results for a box, retrieve information about classes and scores, pad it again and finally gather results across the batch.

There is a reason why this op was added to TF, as recreating it from scratch is quite complicated, therefore I would like to ask if you could add the support for it in the ONNX.

Moreover, when I was running the model with partially applied changes I was observing significant performance drop when running it with automatic mixed-precision - in short: replacing the CombinedNonMaxSuppression has a noticeable impact on the original TF model, which is not ideal.

TomWildenhain-Microsoft commented 3 years ago

Hi @jan-golda to convert this op, we can either 1) compose it out of ONNX ops or 2) implement it as a custom op. From your experimenting in TF is seems like you are finding making a compositions of ops to be expensive. We have previously made relatively complicated ops out of compositions of ops with decent performance, but sometimes it can't be done. In this case, I suspect we can get an efficient composition. Are you using any loops in your implementation, or are you using only batched tensor ops + Gather?

jan-golda commented 3 years ago

Hi @TomWildenhain-Microsoft Well, I was trying to use the NonMaxSuppression, but since it does no support batching nor class-wise suppression I had to iterate over both the batch and the classes. I've implemented that using two nested map_fn accompanied with a lot of padding/stacking/reshaping. I expect I could just use a for loop for that, no idea what impact this would have on performance.

guschmue commented 3 years ago

Should be possible to add support for CombinedNonMaxSuppression since the onnx nms op supports batch. We actually unsqueeze the input to get a batch size of 1.

jan-golda commented 3 years ago

@guschmue nice to hear that!

Do you think it would be possible to implement that in the near future?

TomWildenhain-Microsoft commented 3 years ago

Yep, I'm working on it! Hopefully by end of the week.

TomWildenhain-Microsoft commented 3 years ago

What is the dimension of the scores tensor for your model? TF has 2 different behaviors (are you sharing boxes across classes?)

jan-golda commented 3 years ago

Sorry for the late reply!

There are two separate places in the code where this op is used. Below you will find some example shapes for these two places:

place 1:
  boxes:  [4, 1000, 90, 4]
  scores: [4, 1000, 90]
place 2:
  boxes:  [4, 209664, 1, 4]
  scores: [4, 209664, 1]

So the answer is no - I am not sharing the boxes across classes since the third dimension of boxes always equals the third dimension of scores

TomWildenhain-Microsoft commented 3 years ago

Well... just finished implementing it for the other version: #1376. Non-sharing is a little harder since ONNX does share boxes across classes for NMS. I could just make the boxes shared and zero-out the score for all but one class, but that will be a lot of zeros (90 per box, so 90 4 1000 * 90 = 32 million (probably too large). Place 2 should work fine with the current implementation since there is only 1 class. For place 1, what is the max_outputs and max outputs per class?

jan-golda commented 3 years ago

max_output_size_per_class=1000,
max_total_size=100

TomWildenhain-Microsoft commented 3 years ago

For experimental purposes, can you try testing the performance of the CombinedNonMaxSuppression implementation I've done so far? Making boxes non-shared will be a decent bit harder but have similar perf so it would be nice to know if the perf is sufficient. Just add a slice before your CombinedNonMaxSuppression to cut the dim from 90 to 1 and see how the perf of the conversion to ONNX compares to TF.

If the perf is not good, we may have to use a custom op or try a different implementation approach.

guschmue commented 3 years ago

assume this is resolved.

PINTO0309 commented 2 years ago

Pure ONNX Multi-Class NonMaximumSuppression, CombinedNonMaxSuppression. https://github.com/PINTO0309/yolact_edge_onnx_tensorrt_myriad

hwangdeyu commented 2 years ago

Pure ONNX Multi-Class NonMaximumSuppression, CombinedNonMaxSuppression. https://github.com/PINTO0309/yolact_edge_onnx_tensorrt_myriad

What a cool job！☺

Kimyuhwanpeter commented 4 months ago

@hwangdeyu I used cv::dnn::NMSbox. I can share my work. but this is onnxruntime C++ not tflite. For tensorflow yolov8, i modifed the code like below (for python) i didn't include NMS in python but onnxruntim C++

python

import tensorflow as tf
import keras_cv
import keras

import model
import config as CONFIG
import loss_v2 as lo
import tf2onnx

from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
from keras_cv.src.backend import ops

if __name__ == "__main__":
    # h5 to SavedModel
    print(r"h5 -> SavedModel")
    nms = keras_cv.layers.NonMaxSuppression(bounding_box_format=CONFIG.CONFIG.box_format,
                                    from_logits=False,
                                    iou_threshold=CONFIG.CONFIG.iou,
                                    confidence_threshold=CONFIG.CONFIG.conf,
                                    max_detections=CONFIG.CONFIG.max_detect)    
    backbone = keras_cv.models.YOLOV8Backbone.from_preset(  # 여기서 백본은 마음것 고치면 될듯!!!!
        "yolo_v8_l_backbone_coco")
    yolo = model.yolov8_model(backbone, CONFIG.CONFIG.nc, CONFIG.CONFIG.box_format)

    yolo.load_weights(CONFIG.CONFIG.save_path + "/yolov8_halmet_color.h5")        
    preds = yolo.outputs[0]
    yolo.outputs[0] = tf.reshape(preds, 
                            [-1, 4, CONFIG.CONFIG.BOX_REGRESSION_CHANNELS // 4])
    yolo.outputs[0] = tf.linalg.matmul(ops.nn.softmax(yolo.outputs[0], axis=-1),
                ops.arange(CONFIG.CONFIG.BOX_REGRESSION_CHANNELS // 4, dtype="float32")[..., None])
    yolo.outputs[0] = tf.squeeze(yolo.outputs[0], -1)

    anchor_points, stride_tensor = lo.get_anchors(image_shape=(CONFIG.CONFIG.img_size, CONFIG.CONFIG.img_size))
    stride_tensor = ops.expand_dims(stride_tensor, axis=-1)

    yolo.outputs[0] = lo.dist2bbox(yolo.outputs[0], anchor_points) * stride_tensor  # box shape is problem?!?!?!?!?

    yolo = tf.keras.Model(inputs=yolo.inputs, outputs=yolo.outputs)
    yolo.summary()
    tf2onnx.convert.from_keras(yolo,

    output_path="/yhkim/yhkim/yuhwan_project/CustomNew_tensor_detection/v1/checkpoint/SavedModel/model.onnx",
                           opset=13)

    yolo.save(filepath=CONFIG.CONFIG.savedmodel_path, save_format='tf')

onnxruntime C++

std::vector<Detection> LDetector::postprocessing(const cv::Size& resizedImageShape,
    const cv::Size& originalImageShape,
    std::vector<Ort::Value>& outputTensors,
    const float& confThreshold, const float& iouThreshold)
{
    // Get the output tensor data and shape
    auto* rawOutputBoxes = outputTensors[0].GetTensorData<float>(); 
    std::vector<int64_t> outputShapeBoxes = outputTensors[0].GetTensorTypeAndShapeInfo().GetShape();
    size_t countBoxes = outputTensors[0].GetTensorTypeAndShapeInfo().GetElementCount();

    auto* rawOutputcls = outputTensors[1].GetTensorData<float>();
    std::vector<int64_t> outputShapeClasses = outputTensors[1].GetTensorTypeAndShapeInfo().GetShape();
    size_t countClass = outputTensors[1].GetTensorTypeAndShapeInfo().GetElementCount();

    std::vector<float> outputs(rawOutputBoxes, rawOutputBoxes + countBoxes);
    std::vector<float> CalsOutputs(rawOutputcls, rawOutputcls + countClass);

    // reshape outputs
    std::vector<std::vector<std::vector<float>>> outputs_reshaped(1, std::vector<std::vector<float>>(8400, std::vector<float>(4, 0.0f)));
    std::vector<std::vector<std::vector<float>>> cles_reshaped(1, std::vector<std::vector<float>>(8400, std::vector<float>(2, 0.0f)));

    std::vector<BoundingBox> outputs_reshaped_new(8400);
    std::vector<int> predClassIds;
    std::vector<cv::Rect> predBoxes;
    std::vector<float> predConfidences;

    for (int i = 0; i < 8400; i++) {
        for (int j = 0; j < 4; j++) {          
            if (j >= 0 && j < 2) {
                cles_reshaped[0][i][j] = CalsOutputs[i * 2 + j];
                outputs_reshaped[0][i][j] = outputs[i * 4 + j];
            }
            else
                outputs_reshaped[0][i][j] = outputs[i * 4 + j];
        }

        outputs_reshaped_new[i].x1 = outputs_reshaped[0][i][0];
        outputs_reshaped_new[i].y1 = outputs_reshaped[0][i][1];
        outputs_reshaped_new[i].x2 = outputs_reshaped[0][i][2];
        outputs_reshaped_new[i].y2 = outputs_reshaped[0][i][3];

        auto max_value_it = std::max_element(cles_reshaped[0][i].begin(), cles_reshaped[0][i].end());
        outputs_reshaped_new[i].score = *max_value_it;
        outputs_reshaped_new[i].classId = std::distance(cles_reshaped[0][i].begin(), max_value_it);

        float xmin = outputs_reshaped[0][i][0];
        float ymin = outputs_reshaped[0][i][1];
        float xmax = outputs_reshaped[0][i][2];
        float ymax = outputs_reshaped[0][i][3];
        float width = xmax - xmin;
        float height = ymax - ymin;

        float x = max(0, min(xmin, this->inputImageShape.width - 1));
        float y = max(0, min(ymin, this->inputImageShape.height - 1));
        width = max(0, min(width, this->inputImageShape.width - x));
        height = max(0, min(height, this->inputImageShape.height - y));
        predBoxes.emplace_back(x, y, width, height);
        predClassIds.push_back(std::distance(cles_reshaped[0][i].begin(), max_value_it));
        predConfidences.push_back(static_cast<float>(*max_value_it));

    }
    std::vector<Detection> detections;
    std::vector<int> indices;
    cv::dnn::NMSBoxes(predBoxes, predConfidences, confThreshold, iouThreshold, indices, 1.0);
    for (int idx : indices) {
        if (predBoxes[idx].width > 0. && predBoxes[idx].height > 0.) {
            Detection det;
            det.box = cv::Rect(predBoxes[idx]);
            //utils::scaleCoords(resizedImageShape, det.box, originalImageShape);

            det.conf = predConfidences[idx];
            det.classId = predClassIds[idx];
            detections.emplace_back(det);
        }
    }

    return detections;
}

it work fine. I hope it will be of help

onnx / tensorflow-onnx

CombinedNonMaxSuppression is not supported in ONNX #1337