thusiyuan / cooperative_scene_parsing

Code for NeurIPS 2018: Cooperative Holisctic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation
http://siyuanhuang.com/cooperative_parsing/main.html
MIT License
101 stars 19 forks source link

Image flipping #14

Open chensong1995 opened 4 years ago

chensong1995 commented 4 years ago

Hello Siyuan,

First of all, thanks so much for your work. I learned a lot from reading your paper and code.

My understanding is that each 3D bounding box is parameterized by 3 basis vectors, 3 coefficients, and a 3D centroid. Theses parameters define the 3D bounding box in the world coordinate system. The extrinsic camera matrix R is the transformation from the world coordinate system to the camera coordinate system, and therefore from p_homo = K * R * P, we can recover 2D image coordinates p_homo from the bounding box corner P in the world space.

If my understanding is correct, when we perform image flipping in dataset preprocessing, we have to flip the 3D bounding box labels in the camera coordinate system, instead of the world coordinate system. However, at this line and this line, it appears to me that you are doing it in the world coordinate system directlty.

This sometimes lead to some errors. From my observation, changing the logic to the following can reduce such errors:

        # read camera parameters                                                                                                                                                                            
        K = self.meta['K'][idx]                                                                                                                                                                             
        R = self.meta['R'][idx]                                                                                                                                                                             
        yaw, pitch, roll = yaw_pitch_row_from_r(R)                                                                                                                                                          
        if flip:                                                                                                                                                                                            
            R_old = R                                                                                                                                                                                       
            R = get_rotation_matrix_from_yaw_pitch_roll(-yaw, pitch, roll)                                                                                                                                  
        else:                                                                                                                                                                                               
            R = get_rotation_matrix_from_yaw_pitch_roll(yaw, pitch, roll)                                                                                                                                   
        # read 3D bounding boxes                                                                                                                                                                            
        num_boxes = len(self.meta['boxes'][idx])                                                                                                                                                            
        raw_basis = np.array([self.meta['boxes'][idx][i]['basis'] for i in range(num_boxes)])                                                                                                               
        raw_coeffs = np.array([self.meta['boxes'][idx][i]['coeffs'] for i in range(num_boxes)])                                                                                                             
        raw_centroid = np.array([self.meta['boxes'][idx][i]['centroid'] for i in range(num_boxes)])                                                                                                         
        if flip:                                                                                                                                                                                            
            for i in range(num_boxes):                                                                                                                                                                      
                # get 3D corners in the world space                                                                                                                                                         
                corners3d = get_corners_of_bb3d_no_index(raw_basis[i],                                                                                                                                      
                                                         raw_coeffs[i],                                                                                                                                     
                                                         raw_centroid[i])                                                                                                                                   
                # get 3D corners in the camera space                                                                                                                                                        
                corners3d = np.matmul(R_old, corners3d.transpose()).transpose()                                                                                                                             
                # flip x axis                                                                                                                                                                               
                corners3d[:, 0] = -corners3d[:, 0]                                                                                                                                                          
                # get 3D corners back in world space                                                                                                                                                        
                corners3d = np.matmul(R.transpose(), corners3d.transpose()).transpose()                                                                                                                     
                # extract centroid, basis, and coeffs from 3D corners                                                                                                                                       
                raw_centroid[i] = corners3d.mean(axis=0)                                                                                                                                                    
                b0_with_scale = (corners3d[1] - corners3d[0]) / 2                                                                                                                                           
                c0 = np.linalg.norm(b0_with_scale)                                                                                                                                                          
                b0 = b0_with_scale / c0                                                                                                                                                                     
                b1_with_scale = (corners3d[1] - corners3d[2]) / 2                                                                                                                                           
                c1 = np.linalg.norm(b1_with_scale)                                                                                                                                                          
                b1 = b1_with_scale / c1                                                                                                                                                                     
                b2_with_scale = (corners3d[1] - corners3d[5]) / 2                                                                                                                                           
                c2 = np.linalg.norm(b2_with_scale)                                                                                                                                                          
                b2 = b2_with_scale / c2                                                                                                                                                                     
                raw_basis[i, 0] = -b0 # flip basis 0                                                                                                                                                        
                raw_basis[i, 1] = b1                                                                                                                                                                        
                # keep b2 as [0, -1, 0] to avoid numerical issues                                                                                                                                           
                raw_coeffs[i] = [-c0, c1, c2]

Looking forward to discussing this with you!

thusiyuan commented 4 years ago

Dear Song,

Thank you for your information and suggestion about our code!

We actually visualized the flipped boxes and reprojected them to the 2D images before when we developed the code, they look correct at that time.

Can you show an example of how the generated boxes came into errors, thus I can have a more intuitive understanding of what is wrong?

In terms of flipping it in the camera coordinate or world coordinate, I think they are equal because we also changed the camera parameter accordingly when flipping the objects in the world coordinate.

Best, Siyuan

On Thu, Jun 25, 2020 at 4:15 PM SONG, Chen notifications@github.com wrote:

Hello Siyuan,

First of all, thanks so much for your work. I learned a lot from reading your paper and code.

My understanding is that each 3D bounding box is parameterized by 3 basis vectors, 3 coefficients, and a 3D centroid. Theses parameters define the 3D bounding box in the world coordinate system. The extrinsic camera matrix R is the transformation from the world coordinate system to the camera coordinate system, and therefore from p_homo = K R P, we can recover 2D image coordinates p_homo from the bounding box corner P in the world space.

If my understanding is correct, when we perform image flipping in dataset preprocessing, we have to flip the 3D bounding box labels in the camera coordinate system, instead of the world coordinate system. However, at this line https://github.com/thusiyuan/cooperative_scene_parsing/blob/master/preprocess/sunrgbd/sunrgbd_process.py#L363 and this line https://github.com/thusiyuan/cooperative_scene_parsing/blob/master/preprocess/sunrgbd/sunrgbd_process.py#L428, it appears to me that you are doing it in the world coordinate system directlty.

This sometimes lead to some errors. From my observation, changing the logic to the following can reduce such errors:

    # read camera parameters
    K = self.meta['K'][idx]
    R = self.meta['R'][idx]
    yaw, pitch, roll = yaw_pitch_row_from_r(R)
    if flip:
        R_old = R
        R = get_rotation_matrix_from_yaw_pitch_roll(-yaw, pitch, roll)
    else:
        R = get_rotation_matrix_from_yaw_pitch_roll(yaw, pitch, roll)
    # read 3D bounding boxes
    num_boxes = len(self.meta['boxes'][idx])
    raw_basis = np.array([self.meta['boxes'][idx][i]['basis'] for i in range(num_boxes)])
    raw_coeffs = np.array([self.meta['boxes'][idx][i]['coeffs'] for i in range(num_boxes)])
    raw_centroid = np.array([self.meta['boxes'][idx][i]['centroid'] for i in range(num_boxes)])
    if flip:
        for i in range(num_boxes):
            # get 3D corners in the world space
            corners3d = get_corners_of_bb3d_no_index(raw_basis[i],
                                                     raw_coeffs[i],
                                                     raw_centroid[i])
            # get 3D corners in the camera space
            corners3d = np.matmul(R_old, corners3d.transpose()).transpose()
            # flip x axis
            corners3d[:, 0] = -corners3d[:, 0]
            # get 3D corners back in world space
            corners3d = np.matmul(R.transpose(), corners3d.transpose()).transpose()
            # extract centroid, basis, and coeffs from 3D corners
            raw_centroid[i] = corners3d.mean(axis=0)
            b0_with_scale = (corners3d[1] - corners3d[0]) / 2
            c0 = np.linalg.norm(b0_with_scale)
            b0 = b0_with_scale / c0
            b1_with_scale = (corners3d[1] - corners3d[2]) / 2
            c1 = np.linalg.norm(b1_with_scale)
            b1 = b1_with_scale / c1
            b2_with_scale = (corners3d[1] - corners3d[5]) / 2
            c2 = np.linalg.norm(b2_with_scale)
            b2 = b2_with_scale / c2
            raw_basis[i, 0] = -b0 # flip basis 0
            raw_basis[i, 1] = b1
            # keep b2 as [0, -1, 0] to avoid numerical issues
            raw_coeffs[i] = [-c0, c1, c2]

Looking forward to discussing this with you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thusiyuan/cooperative_scene_parsing/issues/14, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGVPF73RSHH6OAZXG5TJYDRYPLBZANCNFSM4OI2GNNQ .