RetinaNet predict() is too slow

After upgrading keras-cv to v0.6.1 I noticed that predict method of RetinaNet model became really slow comparing with v0.5.1 as result it complicates COCO metrics evaluation.

inputs = [SINGLE_IMAGE]
model = keras_cv.models.RetinaNet.from_preset(...)
model.predict(inputs)

When gettting predictions for a single image in 0.6.1: 1/1 [==============================] - 42s 42s/step

And in 0.5.1: 1/1 [==============================] - 6s 6s/step

Another problem with predictions is that this function throws an exception when passing a generator as an argument. And again, in 0.5.1 it worked perfectly.

y_pred = model.predict(image_generator(...))
TypeError: in user code:

    File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2169, in predict_function  *
        return step_function(self, iterator)
    File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2155, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.10/dist-packages/keras/engine/training.py", line 2143, in run_step  **
        outputs = model.predict_step(data)
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet.py", line 248, in predict_step
        return self.decode_predictions(outputs, args[-1])
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/models/object_detection/retinanet/retinanet.py", line 289, in decode_predictions
        anchors = self.anchor_generator(image_shape=image_shape)
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/layers/object_detection/anchor_generator.py", line 179, in __call__
        generator(image_shape),
    File "/usr/local/lib/python3.10/dist-packages/keras_cv/layers/object_detection/anchor_generator.py", line 264, in __call__
        0.5 * stride, math.ceil(image_width / stride) * stride, stride

    TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

I'd stick with the older version until the issue is resolved, but unfortuantely COCO metrics seem to be broken in 0.5.1 as was reported in the following issue https://github.com/keras-team/keras-cv/issues/1994

Colab for reproducing the issue: https://colab.research.google.com/drive/1dzJFiVIxXtJCoj-ShjRyu-ZPkf6ClCdj?usp=sharing

Thanks for reporting the issue. keras-cv was updated to support keras-core along with the existing tf.keras backend and might have had this regression. I will take a look and get back.

On the above error, you could wrap it in tf.data.DataSet.from_generator for your generator function and provide it the proper shape of the image -

from functools import partial

def image_generator(file_path):
    image_batch = preprocess_image(file_path)
    for i in range(3):
        yield image_batch

ds = tf.data.Dataset.from_generator(
    partial(image_generator, image_path),
    output_types=tf.float32,
    output_shapes=(None, 640, 640, 3),
)

y_pred = model.predict(ds)

This will resolve the error.

Thanks for help, the trick with dataset worked fine.

With regards to the slow prediction. It becomes much faster after assigning a custom NMS with any thresholds:

model.prediction_decoder = keras_cv.layers.MultiClassNonMaxSuppression(
    bounding_box_format="xywh",
    from_logits=True,
    iou_threshold=1.0,
    confidence_threshold=0.0
)

However, it cannot be used for training, because COCO metrics validation callback raises an exception. I'm new to the object detection so not sure if it's expected behavior or not.

Traceback (most recent call last):
  File ".../object_detection/model.py", line 570, in <module>
    run_training()
  File ".../object_detection/model.py", line 334, in run_training
    model.fit(
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras_cv/callbacks/pycoco_callback.py", line 132, in on_epoch_end
    metrics = compute_pycoco_metrics(ground_truth, predictions)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras_cv/metrics/coco/pycoco_wrapper.py", line 219, in compute_pycoco_metrics
    coco_predictions = _convert_predictions_to_coco_annotations(predictions)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/axel/miniforge3/envs/ml/lib/python3.11/site-packages/keras_cv/metrics/coco/pycoco_wrapper.py", line 131, in _convert_predictions_to_coco_annotations
    predictions["detection_boxes"][i][j]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: too many indices for array: array is 0-dimensional, but 1 were indexed

I have this validation callback in my code:

keras_cv.callbacks.PyCOCOCallback(validation_data=eval_ds, bounding_box_format="xywh", cache=False)

@axelusarov is the slower prediction just for the first step (e.g. due to graph tracing), or is it also that every step is slower?

I suppose that the single-class NMS is the likely issue here in terms of performance.

I don't think this error is expected behavior -- this looks like a bug. I believe that #2030 should fix it. Sorry for the regression and thanks for the issue report!

@ianstenbit prediction is slow for each step. Here is predict() example for 5 images before/after assigning new MultiClassNonMaxSuppression with default parameters to the model.

(ml) % python test.py
Using TensorFlow backend
5/5 [==============================] - 121s 23s/step
(ml) % python test.py
Using TensorFlow backend
5/5 [==============================] - 2s 119ms/step

After using the code from master, validation step is much faster now when using custom NMS and I didn't notice errors this time. I didn't check training quality though.

Yeah this looks like there's some significant slowdown with the single-class NMS for TensorFlow, and it's probably just just due to graph tracing. This is something we should look into further, but I can't prioritize it right now

Was about to write a GitHub issue about this, but yes single-class NMS is much slower then mutli-class. Not just prediction, training is 10x slower. I haven't spent much time looking at the source code, graph tracing, etc., but I can take a look at this to see what is going on.

After looking into this, it seems that NonMaxSuppression calls tf.image.non_max_suppression_padded() in image_ops_impl.py. This calls non_max_suppression_padded_v2 in the same file, which proceeds to conduct NMS in pure Python. This is in contrast to MultiClassNonMaxSuppression, which calls tf.image.combined_non_max_suppression () in image_ops_impl.py. This proceeds to call gen_image_ops.combined_non_max_suppression(), which I believe runs the NMS in C++. This makes makes the 10x speed up possible. Anyone have any idea why this was done? For reference here is tf.image.non_max_suppression_padded():

@tf_export('image.non_max_suppression_padded')
@dispatch.add_dispatch_support
def non_max_suppression_padded(boxes,
                               scores,
                               max_output_size,
                               iou_threshold=0.5,
                               score_threshold=float('-inf'),
                               pad_to_max_output_size=False,
                               name=None,
                               sorted_input=False,
                               canonicalized_coordinates=False,
                               tile_size=512):
  with ops.name_scope(name, 'non_max_suppression_padded'):
    if not pad_to_max_output_size:
      # pad_to_max_output_size may be set to False only when the shape of
      # boxes is [num_boxes, 4], i.e., a single image. We make best effort to
      # detect violations at compile time. If `boxes` does not have a static
      # rank, the check allows computation to proceed.
      if boxes.get_shape().rank is not None and boxes.get_shape().rank > 2:
        raise ValueError("'pad_to_max_output_size' (value {}) must be True for "
                         'batched input'.format(pad_to_max_output_size))
    if name is None:
      name = ''
    # idx, num_valid = non_max_suppression_padded_v2(
    #     boxes, scores, max_output_size, iou_threshold, score_threshold,
    #     sorted_input, canonicalized_coordinates, tile_size)

    idx, num_valid = non_max_suppression_padded_v1(
      boxes, scores, max_output_size, iou_threshold, score_threshold,
      pad_to_max_output_size, name)

    # def_function.function seems to lose shape information, so set it here.
    if not pad_to_max_output_size:
      idx = idx[0, :num_valid]
    else:
      batch_dims = array_ops.concat([
          array_ops.shape(boxes)[:-2],
          array_ops.expand_dims(max_output_size, 0)
      ], 0)
      idx = array_ops.reshape(idx, batch_dims)
    return idx, num_valid

# TODO(b/158709815): Improve performance regression due to
# def_function.function.
# @def_function.function(
#     experimental_implements='non_max_suppression_padded_v2')
def non_max_suppression_padded_v2(boxes,
                                  scores,
                                  max_output_size,
                                  iou_threshold=0.5,
                                  score_threshold=float('-inf'),
                                  sorted_input=False,
                                  canonicalized_coordinates=False,
                                  tile_size=512):

    with ops.name_scope('sort_scores_and_boxes'):
      sorted_scores_indices = sort_ops.argsort(
          scores, axis=1, direction='DESCENDING')
      sorted_scores = array_ops.gather(
          scores, sorted_scores_indices, axis=1, batch_dims=1
      )
      sorted_boxes = array_ops.gather(
          boxes, sorted_scores_indices, axis=1, batch_dims=1
      )
    return sorted_scores, sorted_boxes, sorted_scores_indices

  batch_dims = array_ops.shape(boxes)[:-2]
  num_boxes = array_ops.shape(boxes)[-2]
  boxes = array_ops.reshape(boxes, [-1, num_boxes, 4])
  scores = array_ops.reshape(scores, [-1, num_boxes])
  batch_size = array_ops.shape(boxes)[0]
  if score_threshold != float('-inf'):
    with ops.name_scope('filter_by_score'):
      score_mask = math_ops.cast(scores > score_threshold, scores.dtype)
      scores *= score_mask
      box_mask = array_ops.expand_dims(
          math_ops.cast(score_mask, boxes.dtype), 2)
      boxes *= box_mask
  if not canonicalized_coordinates:
    with ops.name_scope('canonicalize_coordinates'):
      y_1, x_1, y_2, x_2 = array_ops.split(
          value=boxes, num_or_size_splits=4, axis=2)
      y_1_is_min = math_ops.reduce_all(
          math_ops.less_equal(y_1[0, 0, 0], y_2[0, 0, 0]))
      y_min, y_max = tf_cond.cond(
          y_1_is_min, lambda: (y_1, y_2), lambda: (y_2, y_1))
      x_1_is_min = math_ops.reduce_all(
          math_ops.less_equal(x_1[0, 0, 0], x_2[0, 0, 0]))
      x_min, x_max = tf_cond.cond(
          x_1_is_min, lambda: (x_1, x_2), lambda: (x_2, x_1))
      boxes = array_ops.concat([y_min, x_min, y_max, x_max], axis=2)
  # TODO(@bhack): https://github.com/tensorflow/tensorflow/issues/56089
  # this will be required after deprecation
  #else:
  #  y_1, x_1, y_2, x_2 = array_ops.split(
  #      value=boxes, num_or_size_splits=4, axis=2)

  if not sorted_input:
    scores, boxes, sorted_indices = _sort_scores_and_boxes(scores, boxes)
  else:
    # Default value required for Autograph.
    sorted_indices = array_ops.zeros_like(scores, dtype=dtypes.int32)

  pad = math_ops.cast(
      math_ops.ceil(
          math_ops.cast(
              math_ops.maximum(num_boxes, max_output_size), dtypes.float32) /
          math_ops.cast(tile_size, dtypes.float32)),
      dtypes.int32) * tile_size - num_boxes
  boxes = array_ops.pad(
      math_ops.cast(boxes, dtypes.float32), [[0, 0], [0, pad], [0, 0]])
  scores = array_ops.pad(
      math_ops.cast(scores, dtypes.float32), [[0, 0], [0, pad]])
  num_boxes_after_padding = num_boxes + pad
  num_iterations = num_boxes_after_padding // tile_size
  def _loop_cond(unused_boxes, unused_threshold, output_size, idx):
    return math_ops.logical_and(
        math_ops.reduce_min(output_size) < max_output_size,
        idx < num_iterations)

  def suppression_loop_body(boxes, iou_threshold, output_size, idx):
    return _suppression_loop_body(
        boxes, iou_threshold, output_size, idx, tile_size)

  selected_boxes, _, output_size, _ = while_loop.while_loop(
      _loop_cond,
      suppression_loop_body,
      [
          boxes, iou_threshold,
          array_ops.zeros([batch_size], dtypes.int32),
          constant_op.constant(0)
      ],
      shape_invariants=[
          tensor_shape.TensorShape([None, None, 4]),
          tensor_shape.TensorShape([]),
          tensor_shape.TensorShape([None]),
          tensor_shape.TensorShape([]),
      ],
  )

  num_valid = math_ops.minimum(output_size, max_output_size)
  idx = num_boxes_after_padding - math_ops.cast(
      nn_ops.top_k(
          math_ops.cast(math_ops.reduce_any(
              selected_boxes > 0, [2]), dtypes.int32) *
          array_ops.expand_dims(
              math_ops.range(num_boxes_after_padding, 0, -1), 0),
          max_output_size)[0], dtypes.int32)
  idx = math_ops.minimum(idx, num_boxes - 1)

  if not sorted_input:
    index_offsets = math_ops.range(batch_size) * num_boxes
    gather_idx = array_ops.reshape(
        idx + array_ops.expand_dims(index_offsets, 1), [-1])
    idx = array_ops.reshape(
        array_ops.gather(array_ops.reshape(sorted_indices, [-1]),
                         gather_idx),
        [batch_size, -1])
  invalid_index = array_ops.zeros([batch_size, max_output_size],
                                  dtype=dtypes.int32)
  idx_index = array_ops.expand_dims(math_ops.range(max_output_size), 0)
  num_valid_expanded = array_ops.expand_dims(num_valid, 1)
  idx = array_ops.where(idx_index < num_valid_expanded,
                        idx, invalid_index)

  num_valid = array_ops.reshape(num_valid, batch_dims)
  return idx, num_valid

Note that non-max suppression is done totally manually. Interestingly enough, non_max_suppression_v1 did refer to a C++ implementation, only v2 does it in Python. And here is tf.image.combined_non_max_suppression():

@tf_export('image.combined_non_max_suppression')
@dispatch.add_dispatch_support
def combined_non_max_suppression(boxes,
                                 scores,
                                 max_output_size_per_class,
                                 max_total_size,
                                 iou_threshold=0.5,
                                 score_threshold=float('-inf'),
                                 pad_per_class=False,
                                 clip_boxes=True,
                                 name=None):
  with ops.name_scope(name, 'combined_non_max_suppression'):
    iou_threshold = ops.convert_to_tensor(
        iou_threshold, dtype=dtypes.float32, name='iou_threshold')
    score_threshold = ops.convert_to_tensor(
        score_threshold, dtype=dtypes.float32, name='score_threshold')

    # Convert `max_total_size` to tensor *without* setting the `dtype` param.
    # This allows us to catch `int32` overflow case with `max_total_size`
    # whose expected dtype is `int32` by the op registration. Any number within
    # `int32` will get converted to `int32` tensor. Anything larger will get
    # converted to `int64`. Passing in `int64` for `max_total_size` to the op
    # will throw dtype mismatch exception.
    # TODO(b/173251596): Once there is a more general solution to warn against
    # int overflow conversions, revisit this check.
    max_total_size = ops.convert_to_tensor(max_total_size)

    return gen_image_ops.combined_non_max_suppression(
        boxes, scores, max_output_size_per_class, max_total_size, iou_threshold,
        score_threshold, pad_per_class, clip_boxes)

Anyone know why regular non max suppression is being done in python? This issue seems to imply that the solution will have to come from updating tensorflow, not keras-cv, let me know if I am wrong on this.

And that is still the issue. :)

keras-team / keras-cv

RetinaNet predict() is too slow #2014