Closed beunguk closed 5 years ago
Hi, How are you going to use that 3d box on camera image? It is just for visualization? If yes, one way i can think of is that you can use the projected_lidar_labels to create the drawing you want by treating the top and bottom edge of the rectangle as the diagonal lines of the top/bottom surfaces.
Pei
Hi, Thank you for your reply.
However, this is not how I would like to draw cuboids. Assuming top and bottom line as diagonal lines of the top/bottom surfaces could make visualizations wrong. Projected cuboids on the camera images should matched to the corners of cuboids.
In other words, I would like to use center_x, center_y, center_z, length, width, height
as cuboid representation and project them into camera images by using extrinsic
and intrinsic matrices
. I tried to use them but it seems like projected cuboids are slightly shifted.
I attached images what I have now (Red boxes are projected_lidar_labels
and rest of them are cuboids I projected on images). I think it works well on the FRONT camera image because it has extrinsic matrix as almost identity matrix. However, in other images, cuboids are slightly shifted.
Beunguk
It looks like this is because of your projection algorithm is not that accurate? Your projection should cover the same area as projected_lidar_labels?
Note that our cameras are rolling shutter cameras. It has non-trivial effect on side cameras when the SDC is moving at high speed (say > 30MPH). But this does not seem to be the case in your example. The SDC seems to be moving very slowly or even static by looking at the scene. Let me know if this is not the case.
If this is not caused by rolling shutter effect, i think something is wrong in your projection algorithm?
** We are planning to release a projection lib that takes rolling shutter effect into account depending on the community interest. No ETA yet.
One more note: We provide all parameters needed for a user to implement their projection algorithm by taking rolling shutter effect into account in the existing dataset.
Yes, projections should cover the same area as projected lader labels.
I tried to figure out the problem, but I couldn't... In the document, center_x, y, z
are in vehicle frame and calibration(CameraCalibration
) is for 'vehicle frame to camera frame'. So, it should be just matrix multiplication using extrinsic(rotation, translation) and then intrinsic matrices.
What am I missing? What should I consider more to have right results? Is there any kind of sample codes or pseudocode for this?
Thanks, Beunguk
As I mentioned above, one possibility is rolling shutter effect which might not be the problem in your case as the SDC seems to be moving slowly based on the camera images.
One thing to note is the camera frame definition.
The camera frame is placed in the center of the camera lens. The x-axis points down the lens barrel out of the lens. The z-axis points up. The y/z plane is parallel to the sensor plane. The coordinate system is right handed.
So y/z plane of the camera sensor frame is parallel with image. When you do the intrinsic transform, do something like: u = -y / x; (width) v = -z / x; (height)
@beunguk I am working on drawing the 3d projections. Could you please give any details?
Cheers, Shuo
@peisun1115 I try to use the coordinates from the laser_labels to multiply the extrinsic matrix from camera_calibrations and then the intrinsic matrix same as @beunguk.
However, I get the extremely large x and y value. Is there anything wrong in my method?
Cheers
Perhaps the cuboids (Frame.laser_labels
) are implicitly stamped at Frame.timestamp_micros
and the discrepancy in the images above is due to not the shutter of the camera but rather the camera-lidar sync. @beunguk what differences do you see for, say, CameraImage.pose_timestamp
versus Frame.timestamp_micros
? It's sad that these stamps are not just both nano-second timestamps :P but perhaps this difference might illuminate the problem.
That said, the differences in images above look to be a bit more than 100ms or so. Looks more like the camera-lidar matching in the actual Frame
is wrong, like the camera images got put in the wrong Frame
object. I haven't looked at how the lidar scans project onto the side cameras, though... that might disprove this hypothesis.
@peisun1115 What would honestly be really helpful would be nanosecond timestamps for the lidar scans, camera images, and all labels (e.g. perhaps just add a timestamp member to Label
https://github.com/waymo-research/waymo-open-dataset/blob/master/waymo_open_dataset/label.proto#L20 ). In comparison, Argoverse and NuScenes / Lyft Level 5 all have this timestamp info, so doing things like projecting cuboids, interpolating labels, etc, are all very easy. Without those timestamps, I'm not sure it's even possible to compare, say, a Tracking algorithm between Waymo Open and these other datasets, because one would not be able to use the Waymo side cameras & cuboids with the error shown in @beunguk's images.
It would also be nice if motorcycles were broken out of the vehicle class, which would be on par with other datasets.
@pwais Our camera and lidar are well synchronized. We have statistics for all the data released. The maximum error calculated by the timestamps at which lidar/camera scan the same physical point is bounded at [-6ms, 7ms] with >99.999% confidence which is super good (i doubt nuScenes or Lyft have anything close). This is calculated by real data in this dataset (you can calculated it too).
Nanotimestamp is not the problem here. Our projected_laser_labels are computed by the data available in this dataset. We did not use any other information. So it is likely that something wrong is in the projection code used. We are planning to release a rolling shutter projection code but unfortunately we are still going through legal.
If you have the projection code, i can help to take a look.
@YanShuo1992
Can you copy-paste your code? i can help to take a look. Very likely, you did not get the coordinates in camera frame correctly used. When you multiple with intrinsic matrix, it needs to be something like
given a point in camera frame (x, y, z), u_d = -y/x v_d = -z/x
// apply distortion model on u_d, u_v ..... code to apply distortion is ignored.... u_d = u_d f_u + c_u v_d = v_d f_v + c_v
@pwais Our camera and lidar are well synchronized. We have statistics for all the data released. The maximum error calculated by the timestamps at which lidar/camera scan the same physical point is bounded at [-6ms, 7ms] with >99.999% confidence which is super good (i doubt nuScenes or Lyft have anything close). This is calculated by real data in this dataset (you can calculated it too).
Nanotimestamp is not the problem here. Our projected_laser_labels are computed by the data available in this dataset. We did not use any other information. So it is likely that something wrong is in the projection code used. We are planning to release a rolling shutter projection code but unfortunately we are still going through legal.
If you have the projection code, i can help to take a look.
Wow, the synchronization is really good! Is the camera/lidar frame with delta time([-6ms, 7ms]) recorded directly on the car or somehow processed(e.g manually aligned after raw data is recorded?)
@peisun1115 I have no doubt that your lidar-camera sync could be the best on the planet, but then what happened in @beunguk 's examples? I'm not hypothesizing a problem with lidar-camera sync, but rather than the Frame
s themselves might have the wrong content, which would materialize as an error that looks similar to bad lidar-camera sync. It's really troubling to see this error, because one does not get results like this doing straightforward cuboid-to-image projections in other datasets.
@peisun1115 Today, how does one recover the timestamp of a camera image? Is it CameraImage.pose_timestamp + CameraImage.camera_readout_done_time
? It would be helpful to have this documented somewhere, especially because these fields are not read anywhere in the code in this repo. Having examples of data usage in the repo is critical to communicating the semantics of the data.
In order to avoid simple projection problems and other errors as demonstrated in this Github Issue, it sure would be helpful to have a means for exporting the Waymo data to a more well-established format like Kitti (see e.g. in nuscenes https://github.com/nutonomy/nuscenes-devkit/blob/master/python-sdk/nuscenes/scripts/export_kitti.py ). There's probably no expectation that might be made available any time soon, but even the tensorflow/models and TPU teams have made an effort to support MSCOCO format (despite certain drawbacks of that format).
FILENAME = '/content/waymo-od/tutorial/frames' FILENAME ='segment-933621182106051783_4160_000_4180_000.tfrecord' dataset = tf.data.TFRecordDataset(FILENAME, compression_type='') for data in dataset: frame = open_dataset.Frame() frame.ParseFromString(bytearray(data.numpy())) break
calibrations = sorted(frame.context.camera_calibrations, key=lambda c: c.name)
c = calibrations[0] #only need the front camera extrinsic = np.reshape(np.array(c.extrinsic.transform), [4, 4]) extrinsic = extrinsic[0:3] intrinsic = np.reshape(np.array(c.intrinsic), [3, 3])
laser_labels = frame.laser_labels for l in laser_labels: k_mat = np.reshape(np.array([l.box.center_x, l.box.center_y, l.box.center_z, 1]), [4, 1]) p = np.dot(extrinsic,k_mat) p = np.dot(intrinsic,p) x = p[0]/p[2] y = p[1]/p[2]
@peisun1115 This is my code. I find some slices about the 3d coordinate projection. I do have many questions. Could you please help me to check the code and give me a hint?
Cheers
@YanShuo1992 I think the problem is that your code interprets the camera intrinsics incorrectly. The documentation is unfortunately confusing here. CameraCalibration.intrinsic
is NOT (despite its name) the camera's intrinsic matrix (or camera matrix K
) but rather CameraCalibration.intrinsic
is a list containing parameters for K
as well as the distortion model coefficients. See: https://github.com/waymo-research/waymo-open-dataset/blob/master/waymo_open_dataset/dataset.proto#L91 You might want something like:
f_u, f_v, c_u, c_v, k_1, k_2, p_1, p_2, k_3 = c.intrinsic
K = np.array([
[f_u, 0, c_u],
[0, f_v, c_v],
[0, 0, 1]])
For the demo ( https://colab.research.google.com/github/waymo-research/waymo-open-dataset/blob/master/tutorial/tutorial.ipynb ), I see a K
of:
[[2.05555615e+03 0.00000000e+00 9.39657470e+02]
[0.00000000e+00 2.05555615e+03 6.41072182e+02]
[0.00000000e+00 0.00000000e+00 1.00000000e+00]]
Be mindful of their comment: "Note that this intrinsic corresponds to the images after scaling" -- it appears the units of the parameters they provide are not in pixels. I'm not sure where to look up the image size.. CameraImage
ironically has no size parameters, just the jpeg image. (Most good data schemas embed the image dimensions because it's cheap to do so and saves the user the cost of having to decode the image to get them).
While the Waymo authors don't specify the distortion model exactly, I guess we're supposed to assume the one documented at OpenCV's website: https://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html Be careful though, some of the OpenCV code in the calibration module has slight differences between versions.
Perhaps we'll get some unambiguous symbol grounding for Waymo's data format when they provide example code for projecting cuboid labels into the camera frame. I might be wrong in my own interpretation, but it's clear that CameraCalibration.intrinsic
is not K
.
Here is an example code snippet that projects a point to image without taking rolling shutter and distortion into account:
def project_point(point, camera_calibration):
# vehicle frame to camera sensor frame.
extrinsic = tf.reshape(camera_calibration.extrinsic.transform, [4, 4])
vehicle_to_sensor = tf.matrix_inverse(extrinsic)
point1 = point
point1.append(1)
point_camera_frame = tf.einsum('ij,j->i', vehicle_to_sensor, tf.constant(point1, dtype=tf.float32))
u_d = - point_camera_frame[1] / point_camera_frame[0]
v_d = - point_camera_frame[2] / point_camera_frame[0]
# add distortion model here if you'd like.
f_u = camera_calibration.intrinsic[0];
f_v = camera_calibration.intrinsic[1];
c_u = camera_calibration.intrinsic[2];
c_v = camera_calibration.intrinsic[3];
u_d = u_d * f_u + c_u;
v_d = v_d * f_v + c_v;
return [u_d.numpy(), v_d.numpy()]
I have tested this code on the waymo open dataset and it worked for the example flagged by @beunguk
We have documented the data format (including distortion model) here: https://github.com/waymo-research/waymo-open-dataset/blob/master/waymo_open_dataset/dataset.proto#L91-L100
Thank you @peisun1115 . A few questions:
1) Will there be more details and/or code on what actual distortion model is assumed? The citation in dataset.proto
that you highlight is only marginally helpful, and to be honest actual code would help dispel all confusion (see #3 below).
2) Will there be any answer about recovering the camera image timestamp from the given protobuf data, or will we have to wait for the Waymo-legal-approved rolling shutter code example release to see that?
3) Your code sample alludes that CameraCalibration.extrinsic
is actually camera-to-vehicle, while the code clearly claims CameraCalibration.extrinsic
is "Vehicle frame to camera frame" https://github.com/waymo-research/waymo-open-dataset/blob/master/waymo_open_dataset/dataset.proto#L101 . If one were simply reading the protobuf comment, then the tf.matrix_inverse()
should not be necessary; perhaps that's why @beunguk 's examples were off (by approximately 2m translation for side cameras). Perhaps what happened is that somebody read "vehicle from camera frame" and translated the comment to "vehicle to camera frame" in the open source code.
Do you know if this is the only extrinsic transform in the dataset where the value embedded in the protobuf record does not match the semantics of the documentation? (FWIW, this example is a key reason why other frameworks like ROS encapsulate the source and destination frames along with the serialized transformation matrix... it's too easy to get confused).
3) Is there any reason (in this code sample as well as the larger repo) that you guys insist on using Tensorflow (versus numpy
) for all matrix algebra as well as Einstein notation? The Tensorflow teams have shown a commitment to making things accessible and these choices make the sample code relatively more esoteric. While I appreciate that TPUs require the use of Tensorflow operators, TPUs are not (yet?) ubiquitous in the research community. Moreover, as this repo demonstrates, there is interest in shedding the very heavy Bazel dependency if not TFRecords and Tensorflow as well: https://github.com/gdlg/simple-waymo-open-dataset-reader
@peisun1115 Thanks for the code. I test it and it works well. I get the similar boxes as @beunguk . I find the 3d boxes of front objects are slightly large but generally correct. However, the 3d boxes of the side objects are incorrect even in the front camera images. I don't think the rolling shutter and lens distortion would cause such differences. I am wondering a parameter named 'heading' in the laser_labels. How to rotate +x to the surface normal of the SDC front face? Could you please give me any details about it?
@pwais Thank you for the comments. I do find the "Vehicle frame to camera frame" which makes me confused. LOL
@YanShuo1992 Probably because of the way to compute box corners? heading is just the yaw. I was trying to be precise when describing it. I can check in a simple util function to compute box corners which should resolve the confusion.
@pwais
k1 = calibration_.intrinsic(4);
k2 = calibration_.intrinsic(5);
k3 = calibration_.intrinsic(6); // same as p1 in OpenCV.
k4 = calibration_.intrinsic(7); // same as p2 in OpenCV
k5 = calibration_.intrinsic(8); // same as k3 in OpenCV.
r2 = u_n u_n + v_n v_n; r4 = r2 r2; r6 = r4 r2;
r_d = 1.0 + k1 r2 + k2 r4 + k5 * r6;
// If the radial distortion is too large, the computed coordinates will // be unreasonable (might even flip signs). if (r_d < kMinRadialDistortion || r_d > kMaxRadialDistortion) { return false; }
u_nd = u_n r_d + 2.0 k3 u_n v_n + k4 (r2 + 2.0 u_n u_n); v_nd = v_n r_d + k3 (r2 + 2.0 v_n v_n) + 2.0 k4 u_n v_n;
2. There is no notion of timestamp of an image in the context of rolling shutter camera. Each pixel has its own timestamp. camera_image.pose_timestamp is the timestamp of the image center.
3. I have fixed the comment in the codebase. We have consistent definitions of pose transform and extrinsic. If you see any comment that is different from others, then the comment is wrong not code. I've scanned through the codebase and did not find more.
4. There is no particular reason. I am more familiar with tensorflow libs and found einsum very clean to use :)
@YanShuo1992 Without including distortion, projection points outside of the camera FOV are very likely to work very poorly. That might be the reason. It will be helpful if you can copy/paste your projection results (only for objects inside the camera image's FOV).
@peisun1115
I only draw three points instead of the boxes.
p1 = project_point([l.box.center_x - 0.5length, l.box.center_y - 0.5width, l.box.center_z - 0.5height],calibrations[index]) #blue squares
p2 = project_point([l.box.center_x + 0.5length, l.box.center_y + 0.5width, l.box.center_z + 0.5height],calibrations[index]) #red squares
pc = project_point([l.box.center_x, l.box.center_y, l.box.center_z], calibrations[index]) #green squares
The red boxes are the projected_lidar_labels in the frame. I find a margin between the objects and the projected_lidar_labels.
You don't have heading when getting points from the box? Are they 0? How fast is the SDC moving in the scene you selected (check pose difference)? If you do not want to worry about rolling shutter, focus front camera first. Then worry about side camera
@peisun1115 The heading is not 0. I am not sure how to use it. I think the projected 3d points will be on the corners of the projected_lidar_labels if I use the heading information, is that correct?
Sorry, I don't know what SDC is as well. I read the comments in dataset.proto. I assume it is a parameter about the velocity. The larger numbers cause the rolling shutter, is that correct? How will the SDC affect the projection?
You can try this function to get box corners.
@peisun1115 Does only the file name with xxx_with_camera_labels.tfrecord contain the corresponding image frame?
All files contain images (if that is what you meant by 'image frame'). Files with suffix '_with_camera_labels.tfrecord' contain '2D' image labels labeled by humans. All files contain 2D labels projected from lidar (see projected_lidar_labels).
@peisun1115 Thanks for your quick reply.
@peisun1115 I think there might be an axis transform / extrinsic rotation missing from your demo code (or implicit and opaque), and perhaps that lead to some confusion in @YanShuo1992 's images. In particular, it appears that the camera extrinsics do not account for e.g. the x-z axis swap and the y-axis inversion that are rolled into the extrinsics published in at least three other open datasets.
If one wants to compute pixel-frame 2d point p
from ego-frame 3d point P
using the standard method p = K [R | T] P
(ignoring distortion for now...), then roughly:
n = 0 # Front camera?
camera_calibration = frame.context.camera_calibrations[n]
extrinsic = tf.reshape(camera_calibration.extrinsic.transform, [4, 4])
RT = tf.matrix_inverse(extrinsic).numpy()
f_u = camera_calibration.intrinsic[0]
f_v = camera_calibration.intrinsic[1]
c_u = camera_calibration.intrinsic[2]
c_v = camera_calibration.intrinsic[3]
K = np.array([
[f_u, 0, c_u],
[0, f_v, c_v],
[0, 0, 1 ],
])
p = K * RT * P # NOPE !
BUT the above won't work because the extrinsic transform RT
in the TFRecords files appears to maintain the same axes as the ego frame: e.g. +z up. But the camera frame is traditionally +z depth (+x in ego frame). I had to do something like this, at least for the front camera:
p_cam = RT.dot(P.T)
# Move into camera sensor frame
p_cam = p_cam[(2, 1, 0), :]
p_cam = p_cam[(1, 0, 2), :] # ??
p_cam[1, :] *= -1 # ??
p_cam[0, :] *= -1 # ??
p = K.dot(p_cam)
p[:2, :] /= p[2, :]
@peisun1115 It would really be helpful if the documentation of the calibration data in dataset.proto
were improved, e.g. explaining exactly what CameraCalibration.extrinsic
is intended to be, and if the data is really the same in all the TFRecords files. The Waymo dataset here and elsewhere strays from convention for no apparent reason. While I appreciate there are unfortunately different conventions that are at-odds for basic things (e.g. encoding of quaternions, euler angles, basic matrix multiplication versus einstein notation, numpy vs Tensorflow), it would be helpful if Waymo provided data and code that's on par with that provided in other datasets like NuScenes, Lyft Level 5, and Argoverse. Waymo has already elided key classes like motorcycles and road obstacles (e.g. cones) as well as chosen a particularly disconcerting and unfriendly legal position towards the sharing of model weights. It's frustrating to then have problems with basic things like trying to use camera calibration parameters correctly.
@pwais
I can copy this to our code (dataset.proto) to clarify.
u_d = -y/x
v_d = -z/x
u_d = u_d f_u + c_u v_d = v_d f_v + c_v
3. I think the way our camera sensor frame is defined (x-forward) is a little non-conventional. We tried to clarify that on our [website](https://waymo.com/open/data/). Other than that, i think the way to define camera model is pretty standard. It is pretty much the same as the CMU slides mentioned in point 2. What is the 'convention' you are talking about?
I missed this earlier, but the repo I previously linked to has solid support for projecting 3d cuboids to images:
box
properly. The code is very concise, close to one half or one third the amount of code versus Waymo's equivalent.Here's the repo: https://github.com/gdlg/simple-waymo-open-dataset-reader
Additional notable features:
Examples using the code in that repo below. Note a couple of things:
The projection lib they have is a good way to demo the data format. Note that they don't take distortion and rolling shutter into account. .
Note that TFRecord is officially supported by tensorflow. The repo you linked re-implemented some of the logic in the official tensorflow tf record reader. But it misses features such as CRC check. There may also be compatibility issues in the future. Feel free to use that if that meets you needs. We prefer to staying with the officially supported reader in tensorflow for now.
We have lots of dependencies as we have other code in this repo. For example, we provide libs to build model, tf ops to do eval, c++ code to do eval. We try to keep our code quality high. Users are welcome to write their own code (e.g. the repo you linked) if they only need part of the functionality. The repo you linked is a good example of that.
Regarding the labels. Please refer to the labeling policy we publish. Also keep in mind that projection has errors (esp if you don't take distortion and rolling shutter into account). Check the label in lidar 3d view.
Yes, we only label objects within 75m for 3d labels. Again, pls check the lidar 3d view for 3d labels.
I am closing this issue as I think we have clarified the lidar->camera projection and you guys are able to make it roughly work. As we mentioned in the thread, we are planning to release a projection lib but we don't have an ETA yet. Please stay tuned.
Note: the Simple Waymo Open Dataset Reader doesn't check CRC codes (though that might be irrelevant given noise in the Waymo labels). However, if you do need a TFRecord reader that checks CRC codes, you might check out Apache Beam's reader here: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/tfrecordio.py#L67 The cited Beam code was authored by a Google engineer and Google Cloud Sales Engineers are pushing Beam pretty hard onto customers (e.g. Google Dataflow), so it's likely to stay updated.
The Tensorflow-free options cited above allow:
Hi, Thank you for your reply.
However, this is not how I would like to draw cuboids. Assuming top and bottom line as diagonal lines of the top/bottom surfaces could make visualizations wrong. Projected cuboids on the camera images should matched to the corners of cuboids.
In other words, I would like to use
center_x, center_y, center_z, length, width, height
as cuboid representation and project them into camera images by usingextrinsic
andintrinsic matrices
. I tried to use them but it seems like projected cuboids are slightly shifted.I attached images what I have now (Red boxes are
projected_lidar_labels
and rest of them are cuboids I projected on images). I think it works well on the FRONT camera image because it has extrinsic matrix as almost identity matrix. However, in other images, cuboids are slightly shifted.Beunguk
Sorry. I have a question: when you vis projected_lidar_labels,how do you confirm which camera images?
Hi, Thank you for your reply. However, this is not how I would like to draw cuboids. Assuming top and bottom line as diagonal lines of the top/bottom surfaces could make visualizations wrong. Projected cuboids on the camera images should matched to the corners of cuboids. In other words, I would like to use
center_x, center_y, center_z, length, width, height
as cuboid representation and project them into camera images by usingextrinsic
andintrinsic matrices
. I tried to use them but it seems like projected cuboids are slightly shifted. I attached images what I have now (Red boxes areprojected_lidar_labels
and rest of them are cuboids I projected on images). I think it works well on the FRONT camera image because it has extrinsic matrix as almost identity matrix. However, in other images, cuboids are slightly shifted. BeungukSorry. I have a question: when you vis projected_lidar_labels,how do you confirm which camera images?
use CameraName.Name to confirm
Hi,
I'm trying to draw 3d cuboid on 2d camera images, so the all corners could appear in the camera images. I can check there is
projected_lidar_labels
which is 2d bounding boxes for cuboids, but this is not I want to draw. For example, https://www.nuscenes.org/public/images/road.jpg is kind of projected image I would like to make.I tried to use
CameraCalibration
andlaser_labels
in Context to draw cuboids, but I still couldn't figure it out. It seems like cuboids don't align well on objects in camera images.Thanks.