remydubois / lsnms

Large Scale Non maximum Suppression. This project implements a O(nlog(n)) approach for non max suppression, useful for object detection ran on very large images such as satellite or histology, when the number of instances to prune becomes very large.
65 stars 6 forks source link

LSNMS

Speeding up Non Maximum Suppression with a multiclass support ran on very large images by a several folds factor, using a sparse implementation of NMS. This project becomes useful in the case of very high dimensional images data, when the amount of predicted instances to prune becomes considerable (> 10,000 objects).

Run times (on a virtual image of 10kx10k pixels)

Installation

This project is fully installable with pip:

pip install lsnms --upgrade

or by cloning this repo with poetry

git clone https://github.com/remydubois/lsnms
cd lsnms/
poetry install

Only dependencies are numpy and numba.

Example usage:

import numpy as np
from lsnms import nms, wbc

# Create boxes: approx 30 pixels wide / high in Pascal VOC format:
# bbox = (x0, y0, x1, y1) with x1 > x0, y1 > y0
image_size = 10_000
n_predictions = 10_000
topleft = np.random.uniform(0.0, high=image_size, size=(n_predictions, 2))
wh = np.random.uniform(15, 45, size=topleft.shape)
boxes = np.concatenate([topleft, topleft + wh], axis=1).astype(np.float64)
# Create scores
scores = np.random.uniform(0., 1., size=(len(boxes), ))

# Create class_ids if multiclass, 3 classes here
class_ids = np.random.randint(0, 2, size=(len(boxes), ))

# Apply NMS
# During the process, overlapping boxes are queried using a R-Tree, ensuring a log-time search
keep = nms(boxes, scores, iou_threshold=0.5)
# Or, in a multiclass setting
keep = nms(boxes, scores, iou_threshold=0.5, class_ids=class_ids)
boxes = boxes[keep]
scores = scores[keep]

# Apply WBC
pooled_boxes, pooled_scores, cluster_indices = wbc(boxes, scores, iou_threshold=0.5)

Description

Non Maximum Suppression

A nice introduction of the non maximum suppression algorithm can be found here: https://www.coursera.org/lecture/convolutional-neural-networks/non-max-suppression-dvrjH. Basically, NMS discards redundant boxes in a set of predicted instances. It is an essential - and often unavoidable, step of object detection pipelines.

Scaling up the Non Maximum Suppression process

Complexity

Note that the timing reported below are all inclusive: it notably includes the tree building process, otherwise comparison would not be fair.

Run times (on a virtual image of 10kx10k pixels)

For the sake of speed, this repo is entirely (including the binary tree) built using Numba's just-in-time compilation.

Concrete example: Some tests were ran considering ~ 40k x 40k pixels images, and detection inference ran on 512 x 512 overlapping patches (256-strided). Aproximately 300,000 bounding boxes (post patchwise NMS) resulted. Naive NMS ran in approximately 5 minutes on modern CPU, while this implementation ran in 5 seconds, hence offering a close to 60 folds speed up.

Going further: weighted box clustering

For the sake of completeness, this repo also implements a variant of the Weighted Box Clustering algorithm (from https://arxiv.org/pdf/1811.08661.pdf). Since NMS can artificially push up confidence scores (by selecting only the highest scoring box per instance), WBC overcomes this by averaging box coordinates and scores of all the overlapping boxes (instead of discarding all the non-maximally scored overlaping boxes).

Disclaimer:

  1. The tree implementation could probably be further optimized, see implementation notes below.
  2. Much simpler implementation could rely on existing KD-Tree implementations (such as sklearn's), query the tree before NMS, and tweak the NMS process to accept tree query's result. This repo implements it from scratch in full numba for the sake of completeness and elegance.
  3. The main parameter deciding the speed up brought by this method is (along with the amount of instances) the density of boxes over the image: in other words, the amount of overlapping boxes trimmed at each step of the NMS process. The lower the density of boxes, the higher the speed up factor.
  4. Due to numba's compiling process, the first call to each jitted function might lag a bit, second and further function calls (per python session) should not suffer this overhead.

-> Will I benefit from this sparse NMS implementation ? As said above, the main parameter guiding speed up from naive NMS is instance (or boxes) density (rather than image size or amount of instances):


Implementations notes

Tree implementation

Due to Numba compiler's limitations, tree implementations has some specificities: Because jit-class methods can not be recursive, the tree building process (node splitting + children instanciation) can not be entirely done inside the Node.__init__ method:

Note that this would hurt performances because the underlying RTree that we would build on this would be suboptimal: many regions would actually be empty (because RTree builds rectangular regions) and the query time would be impacted.

Instead, here the boxes are offseted forming a "mosaic" of class-wise regions, see figure below.

Illustration of the bbox offsetting process for multiclass support

Performances

The RTree implemented in this repo was timed against scikit-learn's neighbors one. Note that runtimes are not fair to compare since sklearn implementation allows for node to contain between leaf_size and 2 * leaf_size datapoints. To account for this, I timed my implementation against sklearn tree with int(0.67 * leaf_size) as leaf_size.

Tree building time

Trees building times comparison

Tree query time

Trees query times comparison (single query) in a 1000x1000 space