nutonomy / nuscenes-devkit

The devkit of the nuScenes dataset.
https://www.nuScenes.org
Other
2.25k stars 624 forks source link

Calculation of mAP #992

Closed Jayden9912 closed 1 year ago

Jayden9912 commented 1 year ago

Hi.

I noticed in coco evaluation, if two predicted bboxes are associated with the same ground truth bboxes, one would be considered as true positives and the other bbox will be considered as false positive if there is no match with other gt bboxes.

But in nuscenes evaluation, if two predicted bboxes are associated with the same ground truth bboxes, both will be considered as true positives.

Is there any reason it is calculated in such way for nuscenes evaluation?

whyekit-motional commented 1 year ago

@Jayden9912 for nuScenes detection evaluation, I don't think "if two predicted bboxes are associated with the same ground truth bboxes, both will be considered as true positives"

If you take a look at this line of code: https://github.com/nutonomy/nuscenes-devkit/blob/a27342283ec865f83424e936f2ee09494b591ec4/python-sdk/nuscenes/eval/detection/algo.py#L81

For a given sample, once a predicted box has been matched with a ground-truth (GT) box, the GT box is recorded in taken and will no longer be used to find a match among other predicted boxes

Jayden9912 commented 1 year ago

My mistake. Thanks for the help

Jayden9912 commented 1 year ago
hi @whyekit-motional , I am trying to understand mAP calculation implemented. Given this scenario (2 predictions, 3 ground truth boxes) prediction bounding box rank conf score match with GT? TP FP Precision Recall
bbox1 0.9 0 0 1 0 0
bbox2 0.8 1 1 1 0.5 0.33

The graph shape should look like this, image

According to the documentation from scikit learn, they would always make sure that when recall is 0, precision would be 1. "The last precision and recall values are 1. and 0. respectively and do not have a corresponding threshold. This ensures that the graph starts on the y axis."

I recognized that there are some filtering according to min_recall and min_precision here. But it is still quite different from the scipy version.

From my understanding, the implementation from nuscenes will penalize the performance of the model a lot when the detection has the highest confidence but doesn't match with any gt. Is my understanding correct? Or did I miss out anything?

Here is the code:

# writing scripts for mAP evaluation (nuscenes)
from typing import Callable
import numpy as np

MIN_PRECISION = 0.1
MIN_RECALL = 0.1

def center_distance(gt_box, pred_box):
    return np.linalg.norm(np.array(pred_box[:2]) - np.array(gt_box[:2]))

def accumulate(gt_bboxes: np.ndarray, pred_bboxes: np.ndarray, dist_th: float, dist_func: Callable):
    """
    accumulate per class
    format of gt_bboxes and pred_bboxes
    numpy arrays
    gt_boxes = [(1, 2, 3, 4, 5, 6), (2, 3, 4, 5, 6, 7), (1, 2, 3, 4, 5, 6)]  # gt boxes for class 0

    # the bbox should be at the base???
    pred_boxes = [(1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 0.9), (2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 0.8)]  # pred boxes for class 0
    """
    pred_confs = pred_bboxes[:, -1]

    num_gt = len(gt_bboxes)

    # get the sorting index
    sortind = np.argsort(pred_confs)[::-1] # in descending order

    tp = [] # accumulator for true positives
    fp = [] # accumulator for false positives
    conf = [] # accumulator for confidences

    taken = set()
    for ind in sortind:
        pred_box = pred_bboxes[ind]
        this_conf = pred_confs[ind]
        min_dist = np.inf
        match_gt_idx = None

        for gt_idx, gt_box in enumerate(gt_bboxes):

            # Find the closest match among ground truth boxes
            this_distance = dist_func(gt_box, pred_box)
            if this_distance < min_dist:
                min_dist = this_distance
                match_gt_idx = gt_idx

        is_match = min_dist < dist_th

        if is_match:
            taken.add(match_gt_idx)

            # Update tp, fp and confs
            tp.append(1)
            fp.append(0)
            conf.append(this_conf)

        else:
            tp.append(0)
            fp.append(1)
            conf.append(this_conf)

    # no match
    if len(tp) == 0:
        return 0

    # Accumulate
    tp = np.cumsum(tp).astype(float)
    fp = np.cumsum(fp).astype(float)
    conf = np.array(conf)

    # Calculate presision and recall
    prec = tp / (fp + tp)
    rec = tp / float(num_gt)
    print("prec:", prec)
    print("tp:", tp)
    print("rec:", rec)

    # interpolation
    rec_interp = np.linspace(0, 1, 101) # 101 steps, from 0% to 100% recall
    prec = np.interp(rec_interp, rec, prec, right = 0)
    prec = prec[round(100* MIN_RECALL) + 1:]
    prec -= MIN_PRECISION
    prec[prec < 0] = 0
    ap = np.mean(prec)/ (1-MIN_PRECISION)
    print("AP:", ap)

if __name__ == "__main__":
    gt_boxes = np.array([(1, 2, 3, 4, 5, 6), (2, 3, 4, 5, 6, 7), (11, 21, 31, 41, 51, 61)]) #(x,y,z,l,w,h)
    pred_boxes = np.array([(-3, -3, 3.5, 4.5, 5.5, 6.5, 0.9), (2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 0.8)]) #(x,y,z,l,w,h,yaw)
    accumulate(gt_boxes, pred_boxes, 1, center_distance)
whyekit-motional commented 1 year ago

@Jayden9912 in response to your question:

the implementation from nuscenes will penalize the performance of the model a lot when the detection has the highest confidence but doesn't match with any gt. Is my understanding correct? Or did I miss out anything?

mAP itself - not just specifically the implementation of mAP in nuScenes - penalizes the performance of a model a lot when its highest-confidence detection doesn't match any GT (i.e. it is a FP)

And this is rightly so - a model which is extremely confident but produces a FP detection should logically be worse than a model which is slightly less confident but produces a TP detection

Jayden9912 commented 1 year ago

I see. Will study more.

Thanks again for the clarification!