theEricMa / OTAvatar

This is the official repository for OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering [CVPR2023].
306 stars 37 forks source link

question about data.utils.process_camera_inv #22

Closed Orig1n closed 6 months ago

Orig1n commented 11 months ago

I'm confused when I read this function. Do the operations like trans[2] += -10, c *= 0.27 c[1] += 0.015 c[2] += 0.161, K[0,0] = 2985.29/700 * focal / 1050 K[1,1] = 2985.29/700 * focal / 1050 and pose[:3, 3] = pose[:3, 3]/4.0 * 2.7 have any special meaning?

def process_camera_inv(translation, Rs, focals): #crop_params):

    c_list = []

    N = len(translation)
    # for trans, R, crop_param in zip(translation,Rs, crop_params):
    for idx, (trans, R, focal) in enumerate(zip(translation, Rs, focals)):

        idx_prev = max(idx - 1, 0)
        idx_last = min(idx + 2, N - 1)

        trans = np.mean(translation[idx_prev: idx_last], axis = 0)
        R = np.mean(Rs[idx_prev: idx_last], axis = 0)

        # why
        trans[2] += -10
        c = -np.dot(R, trans)

        # # no why
        # c = trans

        pose = np.eye(4)
        pose[:3, :3] = R

        # why
        c *= 0.27
        c[1] += 0.015
        c[2] += 0.161
        # c[2] += 0.050  # 0.160

        pose[0, 3] = c[0]
        pose[1, 3] = c[1]
        pose[2, 3] = c[2]

        # focal = 2985.29
        w = 1024#224
        h = 1024#224

        K =np.eye(3)
        K[0][0] = focal
        K[1][1] = focal
        K[0][2] = w/2.0
        K[1][2] = h/2.0

        Rot = np.eye(3)
        Rot[0, 0] = 1
        Rot[1, 1] = -1
        Rot[2, 2] = -1        
        pose[:3, :3] = np.dot(pose[:3, :3], Rot)

        # fix intrinsics
        K[0,0] = 2985.29/700 * focal / 1050
        K[1,1] = 2985.29/700 * focal / 1050
        K[0,2] = 1/2
        K[1,2] = 1/2     
        assert K[0,1] == 0
        assert K[2,2] == 1
        assert K[1,0] == 0
        assert K[2,0] == 0
        assert K[2,1] == 0  

        # fix_pose_orig
        pose = np.array(pose).copy()

        # why
        pose[:3, 3] = pose[:3, 3]/4.0 * 2.7
        # # no why
        # t_1 = np.array([-1.3651,  4.5466,  6.2646])
        # s_1 = np.array([-2.3178, -2.3715, -1.9653]) + 1
        # t_2 = np.array([-2.0536,  6.4069,  4.2269])
        # pose[:3, 3] = (pose[:3, 3] + t_1) * s_1 + t_2

        c = np.concatenate([pose.reshape(-1), K.reshape(-1)])
        c_list.append(c.astype(np.float32))          

    return c_list
theEricMa commented 10 months ago

Great question! This was indeed a challenge during our research. Although EG3D had published their code, they did not provide an explanation for the manual adjustments they made to their camera poses in their code. This was particularly problematic since our face alignment process was completely different. For the talking face task, we cropped the videos based solely on the bounding box calculated in the first frame, meaning that the subsequent frames were not aligned. Given this approach, determining how to model the rotation and translation was complex, especially when the EG3D camera convention was potentially misleading. We had no option but to manually adjust these parameters in the context of a talking face setting.