Evaluation code - Githubissues

nightrome commented 8 months ago

Hi. Thank you for the nice dataset. I am trying to find out how your evaluation code works in detail.

To evaluate on the test set, we have N=17536 samples/frames and C=7 conditions. So my assumption would be that the total AP is either an average performance over samples or over conditions. However, when running your evaluation code on our results, neither seems to be the case.

Note how Total (your evaluation code) differs from both averaging over samples and averaging over conditions. Furthermore, our method outperforms the other only on 2 out of 7 conditions (and only marginally), but still scores higher. How is that possible if not due to a large imbalance in the frequency of the conditions? However, we already ruled this out by computing the average over samples.

Furthermore, I also validated this against some of the results in https://arxiv.org/pdf/2303.06342.pdf, e.g. the strongest row in table 2. The Total result is neither the average over samples nor conditions.

Any feedback would be welcome.

DongHeePaek commented 8 months ago

Hi, @nightrome

Thank you for taking an interest in our K-Radar dataset and incorporating it into your research. It's gratifying to see our work being utilized in meaningful ways, especially in projects as significant as yours.

I'd like to clarify that the differences in results you're observing are likely due to discrepancies in the training and evaluation settings.

Our evaluation code for the K-Radar dataset primarily adapts the official KITTI evaluation code by traveller59, with modifications to suit the evaluation of multimodal data, including K-Radar.

Key adjustments include:

Coordinate System Adaptation: To accommodate the different coordinate systems between K-Radar and KITTI, we've implemented format conversion code. An issue related to this was raised but confirmed to be error-free, as detailed in Issue #23 on our GitHub.
Region of Interest (RoI) Filtering: Our code filters RoIs frame-by-frame, selecting only those frames with detectable objects for evaluation. Given that 4D Radar's measurement range is limited to -53 to 53 degrees laterally (with the front being 0 degrees), objects outside this range are excluded from evaluation. This decision, reflected in our code (e.g., line 363 of kradar_detection_v2_1.py), ensures fairness by not penalizing the radar for not detecting objects beyond its physical measurement capabilities. Frames lacking detectable objects, therefore, are omitted from evaluation to maintain equitable comparison across multimodal data.

This approach acknowledges that physical constraints (like RoI) can affect the number of frames suitable for evaluation. Thus, averaging results across different conditions without accounting for these factors may not provide a fair assessment.

We encourage experimentation with the publicly released model under a unified specification for all experiments and will gladly assist with debugging if you share your training specifications with us.

Thank you.

ffent commented 1 day ago

Hi @DongHeePaek ,

after a careful assessment, we believe that the K-Radar evaluation script is subject to two errors that were originally discovered by Zhang et al. [1] and also affect frameworks like mmdetection3d. These errors are the mixed use of 11 and 41-point interpolation schemes and the so-called ‘average precision distortion’ phenomena. Because of these errors, it is not possible to reconstruct the 'all' mAP value from the individual mAP values per weather category.

The affected lines of code can be found here:

https://github.com/kaist-avelab/K-Radar/blob/main/utils/kitti_eval/eval.py#L15 https://github.com/kaist-avelab/K-Radar/blob/main/utils/kitti_eval/eval.py#L22 https://github.com/kaist-avelab/K-Radar/blob/main/utils/kitti_eval/eval.py#L522 https://github.com/kaist-avelab/K-Radar/blob/main/utils/kitti_eval/eval.py#L606

If these errors would be solved, the 'all' mAP value could be calculated as the weighted average of the individual mAP values per weather category weighted by the number of valid ground truth bounding boxes per weather category. To verify this assumption we have used the NeurIPS2022 results from the provided RTNH logs and the official evaluation script. First, we reproduced the RTNH results from the K-Radar paper with a mixed use of an 11-point interpolation scheme and 41 sample points. After that, we ran the evaluation again, but this time with a uniform 101-point interpolation. Finally, we also ran an evaluation with a 1001-point interpolation to investigate the effects of a more fine-grain interpolation.

Reproduced results with 11 and 41-point interpolation

| | Samples | Vailid GT Boxes | Valid DT Boxes | | mAP | |-----------|---------|-----------------|----------------|---|--------| | all | 9951 | 18232 | 66843 | | 47,440 | | | | | | | | | normal | 4266 | 8760 | 29489 | | 49,943 | | overcast | 383 | 914 | 2365 | | 56,673 | | fog | 1046 | 1380 | 6083 | | 52,806 | | rain | 1309 | 3635 | 9944 | | 41,981 | | sleet | 1088 | 1229 | 7144 | | 41,452 | | lightsnow | 776 | 1023 | 5274 | | 50,569 | | heavysnow | 1083 | 1291 | 6544 | | 44,510 | | | | | | | | | sum | 9951 | 18232 | 66843 | | | ``` GT Boxes weighted mAP: 47,988 (delta +0,547) ```

Results with 101-point interpolation

| | Samples | Vailid GT Boxes | Valid DT Boxes | | mAP | |-----------|---------|-----------------|----------------|---|--------| | all | 9951 | 18232 | 66843 | | 48,015 | | | | | | | | | normal | 4266 | 8760 | 29489 | | 51,884 | | overcast | 383 | 914 | 2365 | | 58,089 | | fog | 1046 | 1380 | 6083 | | 53,229 | | rain | 1309 | 3635 | 9944 | | 40,212 | | sleet | 1088 | 1229 | 7144 | | 40,073 | | lightsnow | 776 | 1023 | 5274 | | 49,657 | | heavysnow | 1083 | 1291 | 6544 | | 44,216 | | | | | | | | | sum | 9951 | 18232 | 66843 | | | ``` GT Boxes weighted mAP: 48,506 (delta +0,490) ```

Results with 1001-point interpolation

| | Samples | Vailid GT Boxes | Valid DT Boxes | | mAP | |-----------|---------|-----------------|----------------|---|--------| | all | 9951 | 18232 | 66843 | | 47,879 | | | | | | | | | normal | 4266 | 8760 | 29489 | | 51,806 | | overcast | 383 | 914 | 2365 | | 52,978 | | fog | 1046 | 1380 | 6083 | | 53,177 | | rain | 1309 | 3635 | 9944 | | 39,935 | | sleet | 1088 | 1229 | 7144 | | 39,986 | | lightsnow | 776 | 1023 | 5274 | | 49,422 | | heavysnow | 1083 | 1291 | 6544 | | 44,247 | | | | | | | | | sum | 9951 | 18232 | 66843 | | | ``` GT Boxes weighted mAP: 48,136 (delta +0,256) ```

Our results show that a more accurate interpolation minimizes the average precision distortion and the deviations in the calculation of the 'all' mAP value. We would therefore recommend fixing the interpolation error within a future release.

References: [1] Zhang, Haodi, Alexandrina Rogozan, and Abdelaziz Bensrhair. "An enhanced N-point interpolation method to eliminate average precision distortion." Pattern Recognition Letters 158 (2022): 111-116.

kaist-avelab / K-Radar

Evaluation code #28