Discrepancies between Command Line Tool and Tensorflow Op for 3D Detection Evaluations

kts707 commented 3 years ago

Hi all,

Thanks for this great dataset!

I am running evaluations for 3D Detection on Validation Set, and I am seeing some consistent discrepancies for all my models between the results from command line tool and the results from Tensorflow metrics op as below:

For example, this is the result from command line tool:

OBJECT_TYPE_TYPE_VEHICLE_LEVEL_1: [mAP 0.732837] [mAPH 0.726763]
OBJECT_TYPE_TYPE_VEHICLE_LEVEL_2: [mAP 0.645357] [mAPH 0.639864]
OBJECT_TYPE_TYPE_PEDESTRIAN_LEVEL_1: [mAP 0.632034] [mAPH 0.544628]
OBJECT_TYPE_TYPE_PEDESTRIAN_LEVEL_2: [mAP 0.544763] [mAPH 0.46832]
OBJECT_TYPE_TYPE_SIGN_LEVEL_1: [mAP 0] [mAPH 0]
OBJECT_TYPE_TYPE_SIGN_LEVEL_2: [mAP 0] [mAPH 0]
OBJECT_TYPE_TYPE_CYCLIST_LEVEL_1: [mAP 0.645921] [mAPH 0.631791]
OBJECT_TYPE_TYPE_CYCLIST_LEVEL_2: [mAP 0.621488] [mAPH 0.607886]

And this is the result from Tensorflow op for the same model:

OBJECT_TYPE_TYPE_VEHICLE_LEVEL_1/AP: 0.7450
OBJECT_TYPE_TYPE_VEHICLE_LEVEL_1/APH: 0.7385
OBJECT_TYPE_TYPE_VEHICLE_LEVEL_2/AP: 0.6542
OBJECT_TYPE_TYPE_VEHICLE_LEVEL_2/APH: 0.6484
OBJECT_TYPE_TYPE_PEDESTRIAN_LEVEL_1/AP: 0.6364
OBJECT_TYPE_TYPE_PEDESTRIAN_LEVEL_1/APH: 0.5480
OBJECT_TYPE_TYPE_PEDESTRIAN_LEVEL_2/AP: 0.5471
OBJECT_TYPE_TYPE_PEDESTRIAN_LEVEL_2/APH: 0.4701
OBJECT_TYPE_TYPE_SIGN_LEVEL_1/AP: 0.0000
OBJECT_TYPE_TYPE_SIGN_LEVEL_1/APH: 0.0000
OBJECT_TYPE_TYPE_SIGN_LEVEL_2/AP: 0.0000
OBJECT_TYPE_TYPE_SIGN_LEVEL_2/APH: 0.0000
OBJECT_TYPE_TYPE_CYCLIST_LEVEL_1/AP: 0.6481
OBJECT_TYPE_TYPE_CYCLIST_LEVEL_1/APH: 0.6339
OBJECT_TYPE_TYPE_CYCLIST_LEVEL_2/AP: 0.6216
OBJECT_TYPE_TYPE_CYCLIST_LEVEL_2/APH: 0.6080

The gaps are large for vehicles and near-range objects. I am wondering where these consistent discrepancies come from. Is there any difference between the two tools?

Thanks a lot!

peisun1115 commented 3 years ago

You can set this to 0.0001 when calling the tf op. I will re-build the pip package to fix this.

Alternatively, you can submit to the leaderboard (validation set) to get accurate numbers.

kts707 commented 3 years ago

@peisun1115 Thanks for the quick response!

I also set desired_recall_delta to 0.0001 and re-build the command line tool. If I use desired_recall_delta = 0.0001 for both Tensorflow Op and command line tool, then their results are matching now. 👍

However, the results from the leaderboard (online validation set evaluation server) are still the same as the case when I am using desired_recall_delta = 0.05 in command line tool (exactly same as the one shown above from command line tool), and they are not matching with the results when desired_recall_delta = 0.0001. I guess the online evaluation server's desired_recall_delta also needs to be updated.

Thanks for your help again!

waymo-research / waymo-open-dataset

Discrepancies between Command Line Tool and Tensorflow Op for 3D Detection Evaluations #360