google-research-datasets / Objectron

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
Other
2.24k stars 263 forks source link

Projecting detected planes into image coordinates #11

Open swtyree opened 4 years ago

swtyree commented 4 years ago

Hi, thanks for the cool dataset!

I have been tinkering with objectron-geometry-tutorial.ipynb, exploring the available meta-data. I haven't been able to successfully transform the extracted planes into image space for visualization. I tried using the same procedure by which the bounding box coordinates are projected into image pixels, but that doesn't seem to have worked since I have many unreasonable values, e.g. values that are negative or much larger than image bounds.

Here's the code that I used:

plane_points = np.array([[v.x,v.y,v.z,1] for v in plane.geometry.vertices])
plane_points_3d_world = transform @ plane_points.T
plane_points_3d_cam = frame_view_matrix @ plane_points_3d_world
plane_points_2d_proj = frame_projection_matrix @ plane_points_3d_cam

plane_points2d_ndc = plane_points_2d_proj[:-1, :] / plane_points_2d_proj[-1, :]
plane_points2d_ndc = plane_points2d_ndc.T

x = plane_points2d_ndc[:, 1]
y = plane_points2d_ndc[:, 0]
plane_points2d = np.copy(plane_points2d_ndc)
plane_points2d[:, 0] = ((1 + x) * 0.5) * width
plane_points2d[:, 1] = ((1 + y) * 0.5) * height

plane_points2d = np.round(plane_points2d).astype(np.int32)
for point_id in range(plane_points2d.shape[0]):
    cv2.circle(image, (plane_points2d[point_id, 0], plane_points2d[point_id, 1]), 25, (0, 255, 255), -1)

Also, there's a small bug in the notebook in the definition of grab_frame. The line

current_frame = np.frombuffer(
        pipe.stdout.read(frame_size), dtype='uint8').reshape(width, height, 3)

has width and height transposed.

Thanks for any help you can provide!

ahmadyan commented 4 years ago

Your code seems correct to me. The planes are estimated by the AR tracking system in 3D, across multiple previous frames. So they are not limited just to the current frames and they might be out of the boundaries, or even behind the camera (thus negative values). Also I would trust the planes in the later frames in the video more, as the tracking system had more time to refine those planes. You can get more information about plane geometry from this reference.

download (1)

It is easier to visualize it in 3D:

download

If you want to get plane points visible in the camera, you need to create a grid from the plane polygons and for each point on the grid, project it and check if it is visible in the image.

Also thanks for the bug-report, currently the bike videos have an issue where the portrait mode of the video is not properly detected by ffmpeg. I will fix it in the next update.