Thanks for the excellent dataset. A few brief questions

johnwlambert commented 4 years ago

Hi Waymo Research, thank you for providing this excellent dataset to the community. A few brief questions:

Would it be possible to add a Python script to create_submission for 3d tracking, in order to avoid the large Bazel build dependency?
How is the "no-label zone" defined?
What is the true positive threshold (in meters) for MOTA?

peisun1115 commented 4 years ago

I found it quite slow to iterate through all objects natively in python. You can just write one on your own very quickly. It is almost the same as this but just populating a few more submission fields.
it is an area that we don't want to label. https://github.com/waymo-research/waymo-open-dataset/blob/master/docs/labeling_specifications.md#no-label-zones
we use IoU 0.7 for car, 0.5 for ped and cyclist.

johnwlambert commented 4 years ago

Thanks very much @peisun1115. Regarding (3): Question about MOTA/L2 -- I believe I misunderstood this part -- "L2" has nothing to do with "L2" distance (a TP threshold in distance, like in Argoverse), but only on the "difficult" Level 2 category?

A few other brief questions, if you have a moment:

Track Death process (Sec 5.2 of Waymo Open Dataset Paper) -- I see how track scores are updated, but what are the criteria for birth and death? ("For our birth and death process, we simply increment the score of the track with the associated detection score if seen, decrement by a fixed cost (0.3) if the track is unmatched, and provide a floor and ceiling of the score [0, 3].")
Are images in the TFRecords currently undistorted? If not, is there a particular reason? For example, Argoverse releases undistorted images in order to simplify the camera model.
Question about the egovehicle to camera geometry projection with Waymo Dataset calibration. I'm trying to implement a simple projection that ignores rolling shutter, but the uv values are completely wrong. This is a detection that should be right in front of the egovehicle. Any thoughts on what I might be doing wrong? In Numpy since I use Pytorch:

import numpy as np
from scipy.spatial.transform import Rotation

RING_IMAGE_SIZES = {
    # width x height
    'ring_front_center': (1920, 1280),
    'ring_front_left':  (1920, 1280),
    'ring_side_left': (1920, 886),
    'ring_side_right': (1920,886)
}

def SE3_from_R_t(rotation, translation):
    """ """
    dst_SE3_src = np.eye(4)
    dst_SE3_src[:3,:3] = rotation
    dst_SE3_src[:3,3] = translation
    return dst_SE3_src

def apply_SE3(dst_SE3_src, pts_src):
    num_pts = pts_src.shape[0]
    homogeneous_pts = np.hstack([pts_src, np.ones((num_pts, 1))])
    dst_pts_h = homogeneous_pts.dot(dst_SE3_src.T)
    return dst_pts_h[:, :3] # remove homogeneous

def proj_egovehicle_to_image(egovehicle_points_h, camera_calib):
    """ """
    f_u, f_v = camera_calib["f_u"], camera_calib["f_v"]
    c_u, c_v = camera_calib["c_u"], camera_calib["c_v"]
    K = np.array(
        [
            [f_u, 0, c_u],
            [ 0 , f_v, c_v],
            [ 0, 0, 1]
        ])
    egovehicle_SE3_camera = SE3_from_R_t(
        rotation=np.array(camera_calib["rotation"]).reshape(3,3),
        translation=camera_calib["translation"]
    )
    waymocam_SE3_egovehicle = np.linalg.inv(egovehicle_SE3_camera)
    standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(-90).dot(rotX(90)),
        translation=np.zeros(3)
    )
    standardcam_SE3_egovehicle = standardcam_SE3_waymocam.dot(waymocam_SE3_egovehicle)
    uv_cam = apply_SE3(standardcam_SE3_egovehicle, egovehicle_points_h)
    uv = K.dot(uv_cam.T).T

    uv[0:2, :] /= uv[2, :]
    uv = uv.T
    uv = uv[:, :2]

    x_valid = np.logical_and(0 <= uv[:, 0], uv[:, 0] < img_width)
    y_valid = np.logical_and(0 <= uv[:, 1], uv[:, 1] < img_height)
    z_valid = uv_cam[2, :] > 0
    valid_pts_bool = np.logical_and(np.logical_and(x_valid, y_valid), z_valid)
    return uv, uv_cam, valid_pts_bool

def rotX(deg: float):
    """
    Compute rotation matrix about the X-axis.
    Args:
    -   deg: in degrees
    """
    t = np.deg2rad(deg)
    return Rotation.from_euler('x', t).as_dcm()

def rotY(deg: float):
    """
    Compute rotation matrix about the Y-axis.
    Args
    -   deg: in degrees
    """
    t = np.deg2rad(deg)
    return Rotation.from_euler('y', t).as_dcm()

def heading_to_rotmat(yaw):
    """ """
    return Rotation.from_euler('z', yaw).as_dcm()

def get_3d_bbox_egovehicle_frame(label):
    """ """
    # 3D bounding box corners. (Convention: x points forward, y to the left, z up.)
    x_corners = label["length"] / 2 * np.array([1, 1, 1, 1, -1, -1, -1, -1])
    y_corners = label["width"] / 2 * np.array([1, -1, -1, 1, 1, -1, -1, 1])
    z_corners = label["height"] / 2 * np.array([1, 1, -1, -1, 1, 1, -1, -1])
    corners_obj_frame = np.vstack((x_corners, y_corners, z_corners)).T

    ego_T_obj = np.array([ label["center_x"], label["center_y"], label["center_z"] ])
    egovehicle_SE3_object = SE3_from_R_t(
        rotation=heading_to_rotmat(label["heading"]),
        translation=ego_T_obj,
    )
    egovehicle_pts = apply_SE3(egovehicle_SE3_object, corners_obj_frame)
    return egovehicle_pts

if __name__ == '__main__':
    """ """
    log_id = '10868756386479184868_3000_000_3020_000'
    camera_name = 'ring_front_center'
    img_width, img_height = RING_IMAGE_SIZES[camera_name]
    camera_calib = {
        'f_u': 2073.4584367052157,
        'f_v': 2073.4584367052157,
        'c_u': 966.2333094058336,
        'c_v': 649.3695587122872,
        "rotation": [
             9.99950565e-01,  3.26395653e-04, -9.93782217e-03,
            -3.60193593e-04,  9.99994157e-01, -3.39934050e-03,
             9.93665458e-03,  3.40275199e-03,  9.99944841e-01
            ],
        "translation": [1.5442434431318834, -0.02298913550094409, 2.1157042633666543],
        "img_width": img_width,
        "img_height": img_height
    }
    det = { 
        'center_x': 11.750835418701172, 
        'center_y': -0.10202184319496155, 
        'center_z': 1.1061757802963257,
        'heading': -0.014,
        'length': 5.668055534362793, 
        'width': 2.4795494079589844, 
        'height': 2.12661075592041,
        'timestamp': 1557962317112437, 
        'label_class': 'VEHICLE', 
        'score': 0.9673179388046265, 
        'context_name': '10868756386479184868_3000_000_3020_000'
    }
    egovehicle_points_h = get_3d_bbox_egovehicle_frame(det)
    uv, uv_cam, valid_pts_bool = proj_egovehicle_to_image(egovehicle_points_h, camera_calib)
    print(uv)

xmyqsh commented 4 years ago

import numpy as np
from scipy.spatial.transform import Rotation

RING_IMAGE_SIZES = {
    # width x height
    'ring_front_center': (1920, 1280),
    'ring_front_left':  (1920, 1280),
    'ring_side_left': (1920, 886),
    'ring_side_right': (1920,886)
}

def SE3_from_R_t(rotation, translation):
    """ """
    dst_SE3_src = np.eye(4)
    dst_SE3_src[:3,:3] = rotation
    dst_SE3_src[:3,3] = translation
    return dst_SE3_src

def apply_SE3(dst_SE3_src, pts_src):
    num_pts = pts_src.shape[0]
    homogeneous_pts = np.hstack([pts_src, np.ones((num_pts, 1))])
    dst_pts_h = homogeneous_pts.dot(dst_SE3_src.T)
    return dst_pts_h[:, :3] # remove homogeneous

def proj_egovehicle_to_image(egovehicle_points_h, camera_calib):
    """ """
    f_u, f_v = camera_calib["f_u"], camera_calib["f_v"]
    c_u, c_v = camera_calib["c_u"], camera_calib["c_v"]
    K = np.array(
        [
            [f_u, 0, c_u],
            [ 0 , f_v, c_v],
            [ 0, 0, 1]
        ])
    egovehicle_SE3_camera = SE3_from_R_t(
        rotation=np.array(camera_calib["rotation"]).reshape(3,3),
        translation=camera_calib["translation"]
    )
    waymocam_SE3_egovehicle = np.linalg.inv(egovehicle_SE3_camera)
    standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(-90).dot(rotX(90)),
        translation=np.zeros(3)
    )
    standardcam_SE3_egovehicle = standardcam_SE3_waymocam.dot(waymocam_SE3_egovehicle)
    uv_cam = apply_SE3(standardcam_SE3_egovehicle, egovehicle_points_h)
    uv = K.dot(uv_cam.T)

    uv[0:2, :] /= uv[2, :]
    uv = uv.T
    uv = uv[:, :2]

    x_valid = np.logical_and(0 <= uv[:, 0], uv[:, 0] < img_width)
    y_valid = np.logical_and(0 <= uv[:, 1], uv[:, 1] < img_height)
    z_valid = uv_cam[:, 2] > 0
    valid_pts_bool = np.logical_and(np.logical_and(x_valid, y_valid), z_valid)
    return uv, uv_cam, valid_pts_bool

def rotX(deg: float):
    """
    Compute rotation matrix about the X-axis.
    Args:
    -   deg: in degrees
    """
    t = np.deg2rad(deg)
    return Rotation.from_euler('x', t).as_dcm()

def rotY(deg: float):
    """
    Compute rotation matrix about the Y-axis.
    Args
    -   deg: in degrees
    """
    t = np.deg2rad(deg)
    return Rotation.from_euler('y', t).as_dcm()

def heading_to_rotmat(yaw):
    """ """
    return Rotation.from_euler('z', yaw).as_dcm()

def get_3d_bbox_egovehicle_frame(label):
    """ """
    # 3D bounding box corners. (Convention: x points forward, y to the left, z up.)
    x_corners = label["length"] / 2 * np.array([1, 1, 1, 1, -1, -1, -1, -1])
    y_corners = label["width"] / 2 * np.array([1, -1, -1, 1, 1, -1, -1, 1])
    z_corners = label["height"] / 2 * np.array([1, 1, -1, -1, 1, 1, -1, -1])
    corners_obj_frame = np.vstack((x_corners, y_corners, z_corners)).T

    ego_T_obj = np.array([ label["center_x"], label["center_y"], label["center_z"] ])
    egovehicle_SE3_object = SE3_from_R_t(
        rotation=heading_to_rotmat(label["heading"]),
        translation=ego_T_obj,
    )
    egovehicle_pts = apply_SE3(egovehicle_SE3_object, corners_obj_frame)
    return egovehicle_pts

if __name__ == '__main__':
    """ """
    log_id = '10868756386479184868_3000_000_3020_000'
    camera_name = 'ring_front_center'
    img_width, img_height = RING_IMAGE_SIZES[camera_name]
    camera_calib = {
        'f_u': 2073.4584367052157,
        'f_v': 2073.4584367052157,
        'c_u': 966.2333094058336,
        'c_v': 649.3695587122872,
        "rotation": [
             9.99950565e-01,  3.26395653e-04, -9.93782217e-03,
            -3.60193593e-04,  9.99994157e-01, -3.39934050e-03,
             9.93665458e-03,  3.40275199e-03,  9.99944841e-01
            ],
        "translation": [1.5442434431318834, -0.02298913550094409, 2.1157042633666543],
        "img_width": img_width,
        "img_height": img_height
    }
    det = {
        'center_x': 11.750835418701172,
        'center_y': -0.10202184319496155,
        'center_z': 1.1061757802963257,
        'heading': -0.014,
        'length': 5.668055534362793,
        'width': 2.4795494079589844,
        'height': 2.12661075592041,
        'timestamp': 1557962317112437,
        'label_class': 'VEHICLE',
        'score': 0.9673179388046265,
        'context_name': '10868756386479184868_3000_000_3020_000'
    }
    egovehicle_points_h = get_3d_bbox_egovehicle_frame(det)
    uv, uv_cam, valid_pts_bool = proj_egovehicle_to_image(egovehicle_points_h, camera_calib)
    print(uv)

[[ 787.52427901  662.04192137]
 [1181.79169642  660.67807689]
 [1183.29599767  999.80708269]
 [ 788.38557168 1000.29337767]
 [ 628.72974776  656.03295542]
 [1326.01590599  653.59076274]
 [1329.09797554 1254.72144892]
 [ 629.79807667 1254.4118628 ]]

@johnwlambert For 6, now it is correct. But this is just a camera projection and cannot solve the rolling shutter problem. For 5, TFRecords stores the undistorted image is for resolving the rolling shutter problem. The pre-distorted image cannot be used to resolve the rolling shutter problem latter. By the way, there is also a field stores the precomputed lidar points to pixel projection which already gets rid of the rolling shutter problem in the TFRecords.

The undistorted image and the complicate rulling shutter aware camera model could be used for feature level lidar point to camera pixel projection which is what I'm considering to do.

Please correct me if there is any misunderstanding.

xmyqsh commented 4 years ago

@johnwlambert For 4, is the criteria of birth-and-death you described above designed for tracker or tracker evaluator? In AB3DMOT and apollo, they only count 3 for birth and count 2 for death. What about your argoverse_cbgs_kf_tracker? What about introduce the criteria of birth-and-death you described above to AB3DMOT?

xmyqsh commented 4 years ago

    standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(-90).dot(rotX(90)),
        translation=np.zeros(3)
    )

    standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(90).dot(rotX(-90)),
        translation=np.zeros(3)
    )

@johnwlambert Converting Vehicle FLU to camera RDF, which is correct? I feel the second one is correct. Correct me if I am wrong.

peisun1115 commented 4 years ago

@johnwlambert

Thanks a lot for @xmyqsh answering 4, 5, 6. I can clarify on 3: Yes, L2 is difficulty. We do not use distance to determine TP. For 4, we use the standard MOTA metric. death/birth is captured by the metric.

johnwlambert commented 4 years ago

@xmyqsh Thanks very much for catching that bug. Things are looking much better now:

1105338229944737854_1280_000_1300_000_ring_front_center_10fps

@peisun1115 Thanks for those answers. Regarding (4): I've always understood Bernardin's MOT metrics as evaluation metrics, and thus they don't define protocols for how a tracker should be implemented (e.g. how it should allow tracks to be born/die). Could you clarify that portion of your paper a bit more?

7) Was there a particular reason LiDAR returns provided in the dataset are clipped to 75m? And was the PPBA baseline detector (detections provided on dataset website) also trained on such truncated LiDAR returns?

8) Any possibility to release a Numpy version of the devkit, instead of TensorFlow, for those users who are committed to the end to Pytorch :-)

johnwlambert commented 4 years ago

@xmyqsh Didn't catch what you meant by "FLU" or "RDF" -- what do those abbreviations stand for? Indeed, from the following figures, it would appear as if your Euler angles were correct:

    standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(90).dot(rotX(-90)),
        translation=np.zeros(3)
    )

(standard camera coordinate frame):

waymo camera coordinate frame (bottom):

However, in practice, I find those Euler angles give the wrong results, and the following is required:

    standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(-90).dot(rotX(90)),
        translation=np.zeros(3)
    )

@peisun1115 any idea why that might be?

xmyqsh commented 4 years ago

@johnwlambert "FLU" means front-left-up, which aligns to the x-y-z coors, which aligns to red-green-blue(dot) as the Laser: TOP shows, or the Vehicle coors. "RDF" means right-down-front, which aligns to the x-y-z coors, which should also align to red-green(cross)-blue, which should be the camera front's. Because in the image coors, the width aligns x and the length aligns y and the origin aligns at the left-top corner of the image. So x-y should be aligned to "RD"(right-down). And then, from the right-hand-side rule, we can get z axis should be aligned to the front, not back.

standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(90).dot(rotX(-90)),
        translation=np.zeros(3)
    )

So, my Eluer angle convertion is for the standard camera coordinate frame as the first image shows, which is also KITTI and lots of others adopt.

For the Eluer angle you provided which is suitable for waymo system:

standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(-90).dot(rotX(90)),
        translation=np.zeros(3)
    )

Let us derive it step-by-step, and see what we can get: Originally, we are at "FLU" which is LIDAR-TOP and Vehicle coors. Be aware that we always in right-hand-side system. According to rotation=rotY(-90).dot(rotX(90)),, first, we, rotX(90), rotate along x-axis by 90 degree, anti-clockwise 90 degree in right hand side system, we gets "FUR"(Front-Up-Right) from "FLU"(Front-Left-Up). And then, we rotY(-90), rotate along y-axis by -90 degree, aka, clockwise 90, we get "RUB"(Right-Up-Back) from "FUR", and this does not consist with waymo camera front Red-Green-Blue(dot), aka, "FLU", as shown in the last waymo camera image, or the standard camera coor system "RDF".

Emm, as we can see, waymo may use a non-standard camera coordinate system. We can conclude that there must be some ghost in the rotation of the camera external parameters.

In my opinion, there are two things should be changed to arrive at the regular camera coordinate system. First, remove the ghost from the rotation of the camera external parameters. Then we can use the standard camera Eluer conversion I provided:

standardcam_SE3_waymocam = SE3_from_R_t(
        rotation=rotY(90).dot(rotX(-90)),
        translation=np.zeros(3)
    )

Second, change the camera coors "FLU"/"Red/Front-Green/Left-Blue(dot)/Up" to "RDF"/"Red/Right-Green(cross)/Down-Blue/Front" in the last image to align the standard camera coordinate system.

@peisun1115 Can you clarify the choice behind waymo's camera coordinate system?

But one thing we can get, waymo do not plan to use camera for 3D detection, which is time-consuming, which is also my choice. For region of interest far away, I mean 100+ areas, (SIGN could also be detected by Lidar point cloud combined with features extracted from image), the 2D detection is enough, and far way 3D detection cannot be reliable.

xmyqsh commented 4 years ago

@johnwlambert For 8, it is interesting of you that always binding numpy with Pytorch. Tensorflow also uses numpy. And most of the numpy expression could be replaced by Pytorch or Tensorflow when we design deep learning system, as detectron2 and waymo-open-dataset do. This allows more tensor runs on gpu/cpu in parallel and optimized by the computational graph. For example, if you use numpy as the box encoder-decoder, you will stuck on these op until it is finished. But if you use Pytorch or Tensorflow to do these, they will run in parallel with other tensor ops.

I have to say waymo open dataset is not targeting to be a devkit, it is just a data accessor. lingvo has written its own devkit by duplicating and reformatting the waymo tfrecord. I also wrote a convertor from waymo tfrecord to nuscene dataset and use nuscene devkit for developing.

@peisun1115 What do you think is the best way to use waymo open dataset? I'm planning open-source my innovative method FSN. I'd like to adopt your advice, because this may be the first third-party repo for using the waymo open dataset.

peisun1115 commented 4 years ago

A lot of information here. Let me know if anything else needs my input.

@johnwlambert Waymo's camera frame (NOT image frame) is defined in the paper. For example code that projects from global frame to camera sensor frame and image frame without rolling shutter, you can refer to this.
@xmyqsh I think you can define your own converter if that is what you are asking. Let me know if you need anything else from me.

xmyqsh commented 4 years ago

Aha, ghost is between camera frame and image frame. No one is incorrect, ghost wins.

johnwlambert commented 4 years ago

@peisun1115 Thanks for your response and for looking through this issue. I consider the image frame to be 2d, and I refer to the 3d frame where the z-axis shoots out of the camera's optical axis to be the "camera coordinate frame". Are you referring to a 3d frame as the "image frame"?

Any clarification on (4) or (7) :) (track birth/death criteria in paper and rationale for truncating LiDAR to 75m)?

waymo-research / waymo-open-dataset

Thanks for the excellent dataset. A few brief questions #192