Open chensong1995 opened 4 years ago
Dear Song,
Thank you for your information and suggestion about our code!
We actually visualized the flipped boxes and reprojected them to the 2D images before when we developed the code, they look correct at that time.
Can you show an example of how the generated boxes came into errors, thus I can have a more intuitive understanding of what is wrong?
In terms of flipping it in the camera coordinate or world coordinate, I think they are equal because we also changed the camera parameter accordingly when flipping the objects in the world coordinate.
Best, Siyuan
On Thu, Jun 25, 2020 at 4:15 PM SONG, Chen notifications@github.com wrote:
Hello Siyuan,
First of all, thanks so much for your work. I learned a lot from reading your paper and code.
My understanding is that each 3D bounding box is parameterized by 3 basis vectors, 3 coefficients, and a 3D centroid. Theses parameters define the 3D bounding box in the world coordinate system. The extrinsic camera matrix R is the transformation from the world coordinate system to the camera coordinate system, and therefore from p_homo = K R P, we can recover 2D image coordinates p_homo from the bounding box corner P in the world space.
If my understanding is correct, when we perform image flipping in dataset preprocessing, we have to flip the 3D bounding box labels in the camera coordinate system, instead of the world coordinate system. However, at this line https://github.com/thusiyuan/cooperative_scene_parsing/blob/master/preprocess/sunrgbd/sunrgbd_process.py#L363 and this line https://github.com/thusiyuan/cooperative_scene_parsing/blob/master/preprocess/sunrgbd/sunrgbd_process.py#L428, it appears to me that you are doing it in the world coordinate system directlty.
This sometimes lead to some errors. From my observation, changing the logic to the following can reduce such errors:
# read camera parameters K = self.meta['K'][idx] R = self.meta['R'][idx] yaw, pitch, roll = yaw_pitch_row_from_r(R) if flip: R_old = R R = get_rotation_matrix_from_yaw_pitch_roll(-yaw, pitch, roll) else: R = get_rotation_matrix_from_yaw_pitch_roll(yaw, pitch, roll) # read 3D bounding boxes num_boxes = len(self.meta['boxes'][idx]) raw_basis = np.array([self.meta['boxes'][idx][i]['basis'] for i in range(num_boxes)]) raw_coeffs = np.array([self.meta['boxes'][idx][i]['coeffs'] for i in range(num_boxes)]) raw_centroid = np.array([self.meta['boxes'][idx][i]['centroid'] for i in range(num_boxes)]) if flip: for i in range(num_boxes): # get 3D corners in the world space corners3d = get_corners_of_bb3d_no_index(raw_basis[i], raw_coeffs[i], raw_centroid[i]) # get 3D corners in the camera space corners3d = np.matmul(R_old, corners3d.transpose()).transpose() # flip x axis corners3d[:, 0] = -corners3d[:, 0] # get 3D corners back in world space corners3d = np.matmul(R.transpose(), corners3d.transpose()).transpose() # extract centroid, basis, and coeffs from 3D corners raw_centroid[i] = corners3d.mean(axis=0) b0_with_scale = (corners3d[1] - corners3d[0]) / 2 c0 = np.linalg.norm(b0_with_scale) b0 = b0_with_scale / c0 b1_with_scale = (corners3d[1] - corners3d[2]) / 2 c1 = np.linalg.norm(b1_with_scale) b1 = b1_with_scale / c1 b2_with_scale = (corners3d[1] - corners3d[5]) / 2 c2 = np.linalg.norm(b2_with_scale) b2 = b2_with_scale / c2 raw_basis[i, 0] = -b0 # flip basis 0 raw_basis[i, 1] = b1 # keep b2 as [0, -1, 0] to avoid numerical issues raw_coeffs[i] = [-c0, c1, c2]
Looking forward to discussing this with you!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thusiyuan/cooperative_scene_parsing/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGVPF73RSHH6OAZXG5TJYDRYPLBZANCNFSM4OI2GNNQ .
Hello Siyuan,
First of all, thanks so much for your work. I learned a lot from reading your paper and code.
My understanding is that each 3D bounding box is parameterized by 3 basis vectors, 3 coefficients, and a 3D centroid. Theses parameters define the 3D bounding box in the world coordinate system. The extrinsic camera matrix R is the transformation from the world coordinate system to the camera coordinate system, and therefore from
p_homo = K * R * P
, we can recover 2D image coordinatesp_homo
from the bounding box cornerP
in the world space.If my understanding is correct, when we perform image flipping in dataset preprocessing, we have to flip the 3D bounding box labels in the camera coordinate system, instead of the world coordinate system. However, at this line and this line, it appears to me that you are doing it in the world coordinate system directlty.
This sometimes lead to some errors. From my observation, changing the logic to the following can reduce such errors:
Looking forward to discussing this with you!