Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
As mentioned in #46, directly fitting a 2D bbox by the projected vertices of 3D bbox can be very inaccurate. For example, the actual 2d bbox in green and the fitted one in red.
My 0.02, you can always run an off-the-shelf 2D object detector (like EfficientNet or Yolo) and get the 2D bounding box and check if it is within the 'over-sized' crop we get from the 3D bounding box.
As mentioned in #46, directly fitting a 2D bbox by the projected vertices of 3D bbox can be very inaccurate. For example, the actual 2d bbox in green and the fitted one in red.
Is there any better way to get the 2d bbox?