ego_dist Detection Evaluation

valGavin commented 3 years ago

Hello, nuScenes contributors. Thank you for providing the nuscenes-devkit; it's been a great help for my research. However, there's something I need to ask about the 3D Detection Evaluation code.

What I've done:

I converted the datasets to KITTI format using the nuscenes_gt_to_kitti presents in the export_kitti.py
I run the detection and dump the result into JSON file using kitti_res_to_nuscenes
I run the evaluation using the detection/evaluate.py script
I tested the evaluation using mapillary_val.json

The problem:

I found that all my boxes are filtered out due to the high ego_dist value (larger than the default max_dist)
I went deeper to the code and found that the ego_dist values are calculated from the box translation (x, y, z); correct me if I'm wrong
I compared my JSON file with the one from mapillary and found that the translation difference is significant. My x and y values are 100 times smaller than the one I observed in mapillary's. I don't know how mapillary can get such values

Note:

I re-run the kitti_res_to_nuscenes using the validation groundtruth (nuScenes -> KITTI -> JSON), and re-run the detection/evaluate.py.
I got the similar ego_dist like mine and everything is filtered out; logically, it shouldn't be like that, right?

I'm sorry for the long question, and I hope you can provide some answer and solution. Thank you in advance.

holger-motional commented 3 years ago

Hi. I believe the difference is that KITTI stores the annotations/predictions in local coordinates, whereas nuScenes stores them in global coordinates, e.g. [1500.37409205876, 2970.935516773256, 0.8198208637558824], which is 1500m East, 2970m North (?), and 0.8m up, relative to the map coordinate frame. The function kitti_res_to_nuscenes does not know what map and sample you are operating on, hence it cannot map to global coordinates. You'd probably need to add that conversion.

valGavin commented 3 years ago

Thank you for your response.

kaushik333 commented 3 years ago

Hi @holger-nutonomy @valGavin

I think my question falls under this bin. If not I can, repost a separate question:

So I am making use of a 3D object detection framework which has been written for kitti dataset i.e. their code database shows results on kitti. Now I am trying to perform inference on nuScenes data (model still trained on kitti) and I am using v1.0_mini data to do so. After I perform prediction, I try to visualize the 3D bounding boxes. However, I find that the boxes are off by some transformation matrix. I find this strange, because export_kitti.py does exactly this: convert nuscenes format+coordinates to that of kitti. Here is a visualization of how the prediction boxes look like. Can you please provide some insight into how I can go about debugging this? Thank you

As you can see its predicting a total of 4 things, but its out of frame.

Why are the converted point-clouds so sparse? Is the point-cloud by LiDAR in nuScenes dataset already quite sparse since u have other sources to get point-cloud such as RADAR?

kaushik333 commented 3 years ago

I also noticed that the predicted bbox values re in some normalized coordinate system whereas nuscenes2kitti conversion does it in some global coordinate system as @holger-nutonomy mentioned above. Can you please help me on how to perform this conversion ? The predictions from my model (trained on kitti) are in normalized coordinates. Is there a way to convert it to global coordinates? Any help you can provide is much appreciated. Thank you very much.

valGavin commented 3 years ago

Hey @kaushik333 . As mentioned by @holger-nutonomy , this is a coordinate issue. I did a modification in the python-sdk/nuscenes/utils/kitti.py script.

Under get_boxes function, there are four steps provided by the original code. I added another step between 4: Transform to nuScenes LIDAR coord system and Set score or Nan:

# 5. Translate the box center point to follow the nuScenes global coordinates
if (pose_record is not None) and (cs_record is not None):
    box.rotate(Quaternion(cs_record['rotation']))
    box.translate(np.array(cs_record['translation']))
    box.rotate(Quaternion(pose_record['rotation']))
    box.translate(np.array(pose_record['translation']))

Which require you to pass the cs_record and pose_record value to this function. You can get those values using:

if coord_transform:
    sample = self.nusc.get('sample', sample_token)
    sample_data = self.nusc.get('sample_data', sample['data']['LIDAR_TOP'])
    pose_record = self.nusc.get('ego_pose', sample_data['ego_pose_token'])
    cs_record = self.nusc.get('calibrated_sensor', sample_data['calibrated_sensor_token'])

by providing the sample_token of the corresponding image.

It worked well with me. I hope this helps your issue.

kaushik333 commented 3 years ago

Hey @kaushik333 . As mentioned by @holger-nutonomy , this is a coordinate issue. I did a modification in the python-sdk/nuscenes/utils/kitti.py script.

Under get_boxes function, there are four steps provided by the original code. I added another step between 4: Transform to nuScenes LIDAR coord system and Set score or Nan:
# 5. Translate the box center point to follow the nuScenes global coordinates
if (pose_record is not None) and (cs_record is not None):
    box.rotate(Quaternion(cs_record['rotation']))
    box.translate(np.array(cs_record['translation']))
    box.rotate(Quaternion(pose_record['rotation']))
    box.translate(np.array(pose_record['translation']))
Which require you to pass the cs_record and pose_record value to this function. You can get those values using:
if coord_transform:
    sample = self.nusc.get('sample', sample_token)
    sample_data = self.nusc.get('sample_data', sample['data']['LIDAR_TOP'])
    pose_record = self.nusc.get('ego_pose', sample_data['ego_pose_token'])
    cs_record = self.nusc.get('calibrated_sensor', sample_data['calibrated_sensor_token'])
by providing the sample_token of the corresponding image.

It worked well with me. I hope this helps your issue.

Thanks for this response. Sorry for the delayed reply. Well turns out that because I was using a model pretrained on kitti to perform inference on NuScenes, I was facing this issue. I just retrained the network on NuScenes and now it seems to perform as expected.

Although I'm extremely surprised by the outcome because, I make use of only the lidar sensor values and not the radar. This is a very sparse point cloud. Inspite of this I achieve reasonable bbox deductions (visually). Although numbers wise, my performance is crap. Any thoughts towards this? @holger-nutonomy @valGavin ?

abhi1kumar commented 2 years ago

Hi @holger-motional I too encountered the same problem, and therefore, arrived at this thread.

KITTI stores the annotations/predictions in local coordinates, whereas nuScenes stores them in global coordinates,

Thank you for this insightful reply. I followed up with @valGavin fix

Under get_boxes function, there are four steps provided by the original code. I added another step between 4: Transform to nuScenes LIDAR coord system and Set score or Nan:
# 5. Translate the box center point to follow the nuScenes global coordinates
if (pose_record is not None) and (cs_record is not None):
box.rotate(Quaternion(cs_record['rotation']))
box.translate(np.array(cs_record['translation']))
box.rotate(Quaternion(pose_record['rotation']))
box.translate(np.array(pose_record['translation']))
Which require you to pass the cs_record and pose_record value to this function in kitti_res_to_nuscenes() in export_kitti.py
if coord_transform:
sample = self.nusc.get('sample', sample_token)
sample_data = self.nusc.get('sample_data', sample['data']['LIDAR_TOP'])
pose_record = self.nusc.get('ego_pose', sample_data['ego_pose_token'])
cs_record = self.nusc.get('calibrated_sensor', sample_data['calibrated_sensor_token'])

After applying this fix, I expected AP3D to be insanely high (close to 1.00) for all classes, and ATE, ASE and AOE to go down to zero. AVE and AAE at minimum are OK because KITTI format does not have any attribute or velocity labels as mentioned here.

However, even after applying his fix, I see the following outputs

mAP: 0.1610
mATE: 1.0000
mASE: 1.0000
mAOE: 1.0000
mAVE: 1.0000
mAAE: 1.0000
NDS: 0.0805
Eval time: 17.5s

Per-class results:
Object Class    AP  ATE ASE AOE AVE AAE
         car    0.160   1.000   1.000   1.000   1.000   1.000
       truck    0.196   1.000   1.000   1.000   1.000   1.000
         bus    0.322   1.000   1.000   1.000   1.000   1.000
     trailer    0.209   1.000   1.000   1.000   1.000   1.000
construction    0.115   1.000   1.000   1.000   1.000   1.000
  pedestrian    0.102   1.000   1.000   1.000   1.000   1.000
  motorcycle    0.214   1.000   1.000   1.000   1.000   1.000
     bicycle    0.084   1.000   1.000   1.000   1.000   1.000
traffic_cone    0.073   1.000   1.000   nan nan nan
     barrier    0.137   1.000   1.000   1.000   nan nan

Is it normal to get such low AP values for car and such high translational errors and scale errors even after using val ground truth for evaluation? It would be great if you have any insights in this regard.

I am also posting the screenshot of entire run for your reference output

holger-motional commented 2 years ago

@abhi1kumar There is definitely a bug in your code. Other submissions have far better results: https://www.nuscenes.org/object-detection. With ground-truth all numbers should be 1.

abhi1kumar commented 2 years ago

@abhi1kumar There is definitely a bug in your code. Other submissions have far better results: https://www.nuscenes.org/object-detection. With ground-truth all numbers should be 1.

Thank you @holger-motional for your quick reply. Yes, I completely agree with you. Others have obtained much higher numbers while testing. And, so using val ground truth (oracle) should definitely have much higher numbers.

I had another related question. Does nuScenes consider the outputs of all six cameras for evaluation or if a single camera data is given, it only considers single camera data for evaluation? If your answer is the former, I know the reason behind this 16% for cars. I am testing the front cameras. So, this 16% AP on cars is because one of the cameras is fully correct while other five cameras are completely wrong.

holger-motional commented 2 years ago

I guess that's your problem :-). nuScenes has annotations from 360 degrees and uses all of them for evaluation. If you want to only evaluate on the front camera, you would have to drop the ground-truth boxes that fall into all other cameras.

abhi1kumar commented 2 years ago

Thank you once again @holger-motional for your quick reply on my issue.

The official evaluation uses six cameras, and therefore, I too have to use all the six cameras for proper benchmarking of our method.

holger-motional commented 2 years ago

@abhi1kumar I guess you can just run it on each camera, combine the boxes, run non-maximum suppression and get the final set of results.

abhi1kumar commented 2 years ago

@holger-motional Thank you once again for helping me out. This is what I have been trying to achieve. I am doing monocular 3D object detection, and my KITTI training pipeline is set. Hence, all I wanted was to convert nuScenes images to the KITTI format, train with the nuScenes images with the KITTI pipeline, get the results in the KITTI format, convert to nuScenes format, and finally upload to the nuScenes server.

The CAM_FRONT camera for nuscenes gets converted to KITTI format without errors. However, other cameras throw the error as mentioned here Going by your answer, other cameras are ambigious, and they throw an assertion error (since the rotations are no longer identities).

I did try commenting out the assertion https://github.com/nutonomy/nuscenes-devkit/blob/864d0a207539e5383cd3eb26ebb1d7a44622f09d/python-sdk/nuscenes/scripts/export_kitti.py#L151 to get kitti style calib and label files.

However, when I try to convert these label files back to nuScenes format, the following error pops up.

ValueError("Matrix must be orthogonal, i.e. its transpose should be its inverse")

at this line https://github.com/nutonomy/nuscenes-devkit/blob/864d0a207539e5383cd3eb26ebb1d7a44622f09d/python-sdk/nuscenes/utils/kitti.py#L326 because the matrix can not be inverted.

Therefore, do you know of any public repo which output the ground truths in the local camera coordinates (for each of the cameras) in the nuscenes. I am neither using LiDAR nor radar data. So, anything which brings objects in the local camera coordinates and converts back to global coordinates for nuscenes images should be fine.

PS - This code looks to do the same, but I have not tested it out.

holger-motional commented 2 years ago

@abhi1kumar Unfortunately I am not aware of any such code :-(.

nutonomy / nuscenes-devkit