Backprojecting a pixel coordinate to a ray in 3D

Hello!

I'm currently trying to cast a 2D bounding box from the front camera feed to the vehicle frame. My approach is similar to the one in issue #66. I've also tried using the camera_model that has been released, through waymo_open_dataset.camera.ops.py_camera_model_ops.image_to_world.

I keep hitting a wall however; the points that the pixels are backcast to are superposed, as seen in the following render that should be showing the lines of a pyramid going from the camera center to the projected points in the vehicle frame. This makes it seem that there is just one line in the pyramid, when in fact there are 4. The camera is represented as the large axis object, and the small ones are the projected points I find from the 4 corners of a bounding box in the image.

Is py_camera_model_ops.image_to_world behaving as expected? When I add the depth information to the pixel coordinates I use 20, like I saw in the tests that were written here. It seems strange to me that all the pixels are projected to the same point. Am I doing something wrong here?

Here is the code I'm using:

import numpy as np
import tensorflow as tf

from waymo_open_dataset.camera.ops import py_camera_model_ops

def project_image_to_vehicle(image, calibration, image_coordinates):
    """Projects from image to global coordinate system.

    Arguments:
        image: image data from dataset_pb2
        calibration: Camera calibration details (including intrinsics/extrinsics).
        points: Points to project of shape [N, 3] in vehicle coordinate system.

    Returns:
        Array of shape [N, 3], with the latter dimension composed of (x, y, z).
    """

    # Populate camera image metadata. Velocity and latency stats are filled with
    # zeroes.
    extrinsic = tf.reshape(
        tf.constant(list(calibration.extrinsic.transform), dtype=tf.float32),
        [4, 4])
    intrinsic = tf.constant(list(calibration.intrinsic), dtype=tf.float32)
    metadata = tf.constant([
        calibration.width,
        calibration.height,
        calibration.rolling_shutter_direction,
    ],
                            dtype=tf.int32)
    camera_image_metadata = list(image.pose.transform)
    camera_image_metadata.append(image.velocity.v_x)
    camera_image_metadata.append(image.velocity.v_y)
    camera_image_metadata.append(image.velocity.v_z)
    camera_image_metadata.append(image.velocity.w_x)
    camera_image_metadata.append(image.velocity.w_y)
    camera_image_metadata.append(image.velocity.w_z)
    camera_image_metadata.append(image.pose_timestamp)
    camera_image_metadata.append(image.shutter)
    camera_image_metadata.append(image.camera_trigger_time)
    camera_image_metadata.append(image.camera_readout_done_time)  
    # camera_image_metadata = list(vehicle_pose) + [0.0] * 10

    # Perform projection and return projected image coordinates (u, v, ok).
    return py_camera_model_ops.image_to_world(extrinsic, intrinsic, metadata,
                                            camera_image_metadata,
                                            image_coordinates).numpy()

def cast_pixels_to_ray(bbox: np.ndarray, calibration, image):
    """Cast the bbox to a plane in the vehicle frame
    Arguments:
        bbox [array]: Array of shape (4,2) representing 4 pixel coordinates in the image
        calibration: calibration proto for the given camera
        image: image proto for the given image
    Returns:
        C [array]: Position of the camera in the vehicle frame
        projected_pts [array]: Array of shape (4,3) representing a point along the ray a pixel can be backcast to.
    """
    intrinsic = np.array(calibration.intrinsic).reshape(3,3)
    extrinsic = np.array(calibration.extrinsic.transform).reshape(4,4)
    vehicle_to_image = np.linalg.inv(extrinsic)
    # Homogeneous transform from lidar to cam coords: 
    # [R|-R@C] where R is the rotation matrix and C is the center of camera
    R_h = vehicle_to_image[:3,:4]
    # Projection matrix, where K is the intrinsic transform:
    # P = K@[R|-R@C] = [M|-M@C]

    # Axes transform
    tr = np.array([
        [0.0, -1.0, 0.0],
        [0.0, 0.0, -1.0],
        [1.0, 0.0, 0.0]
    ])

    # The projection from vehicle frame to image frame is 
    P = intrinsic @ tr @ R_h

    # Center of camera:
    # C = -M^-1 @ p_4 where M=K@R and p_4 is 4th column of P
    C = -np.linalg.inv(P[:3, :3])@P[:,3]

    # This is what I would use to transform pixels to the vehicle frame
    # Not using this because the camera model has further considerations
    backproject = np.linalg.pinv(P)

    # Is it safe to assume that any depth value can be used?
    bbox = np.insert(bbox, 2, values=np.array([20,20,20,20]), axis=1)
    projected_pts = project_image_to_vehicle(bbox, calibration, image)

    return C, projected_pts

waymo-research / waymo-open-dataset

Backprojecting a pixel coordinate to a ray in 3D #525