lelechen63 / ATVGnet

CVPR 2019
260 stars 54 forks source link

How is head pose taken into account for VGnet #17

Open pcgreat opened 5 years ago

pcgreat commented 5 years ago

In LRW dataset, speakers move significantly when they are talking, so unless you did frontalization for each frame (which I assume you didn't?), your frames (ground truth of VGnet) should include various head poses. However, for the input of VGNet, I don't see any of which contains head pose information. If I understand correctly, VGNet has 3 inputs: example_frame, example_landmark and fake_landmarks. Both example_landmark and fake_landmarks are normalized so that they don't include either speaker's identity information or his/her head pose information; example_frame is a still frame, which cannot explain the head poses for a whole sequence either. As none of the VGNet input contains head pose info, I don't understand why VGNet fits so well on moving heads in LRW dataset. Can you explain why that works? Thanks

lelechen63 commented 5 years ago

In LRW dataset, speakers move significantly when they are talking, so unless you did frontalization for each frame (which I assume you didn't?), your frames (ground truth of VGnet) should include various head poses. However, for the input of VGNet, I don't see any of which contains head pose information. If I understand correctly, VGNet has 3 inputs: example_frame, example_landmark and fake_landmarks. Both example_landmark and fake_landmarks are normalized so that they don't include either speaker's identity information or his/her head pose information; example_frame is a still frame, which cannot explain the head poses for a whole sequence either. As none of the VGNet input contains head pose info, I don't understand why VGNet fits so well on moving heads in LRW dataset. Can you explain why that works? Thanks

Hi there, we do not consider head movement in this paper. During training, we decouple the atnet and vgnet to fix the head movement problem. The vgnet does have the ability to output images based on different head movements. However, our ATnet can only output the front-view landmark. When we prepare the dataset for atnet, we use affine transformation to correct the head moment pose. You can check this process in demo.py.

pcgreat commented 5 years ago

Yeah, I saw the procrustes transform code for ATnet, which makes perfect sense. But I might not express myself clearly, let me try to rephrase the question.

I understand how ATNet handles head movement, but what I don't understand is how VGnet deals with the head movement.

The target of VGnet is the ground truth frames in LRW, which contains various poses, but the input of VGnet contains only front-view features. Or in other words (please correctly me if I am wrong), the training process of VGNet is basically

RGB_frames_with different_poses = VGNet(RGB_example_frame, example_landmark, landmarks_with_front_view)

And if that's true, I am curious if the input of VGNet doesn't contain pose information, how does the output figure out which pose to generate?

lelechen63 commented 5 years ago

Yeah, I saw the procrustes transform code for ATnet, which makes perfect sense. But I might not express myself clearly, let me try to rephrase the question.

I understand how ATNet handles head movement, but what I don't understand is how VGnet deals with the head movement.

The target of VGnet is the ground truth frames in LRW, which contains various poses, but the input of VGnet contains only front-view features. Or in other words (please correctly me if I am wrong), the training process of VGNet is basically

RGB_frames_with different_poses = VGNet(RGB_example_frame, example_landmark, landmarks_with_front_view)

And if that's true, I am curious if the input of VGNet doesn't contain pose information, how does the output figure out which pose to generate?

When we train the vgnet, we are using ground truth landmarks, which contain all the head movement information. This is a key part to decouple the training for ATnet and VGnet. If we train end2end, we need to use the outputs of ATnet as the input of VGnet.Then we will have this pose uncertainty problem. Since vgnet is trained with ground truth landmarks with various poses, vgnet can handle different pose angles. But our ATnet can only yield front-view landmark.

pcgreat commented 5 years ago

I see. That makes perfect sense! Have you thought about adding another network to transform from seq_of_front_view_landmark -> seq_of_landmark_with_poses, and place that network between ATNet and VGNet, then it might be able to generate video with different poses. Sounds like an interesting extension to your work, haha

lelechen63 commented 5 years ago

I see. That makes perfect sense! Have you thought about adding another network to transform from seq_of_front_view_landmark -> seq_of_landmark_with_poses, and place that network between ATNet and VGNet, then it might be able to generate video with different poses. Sounds like an interesting extension to your work, haha

I was working on this summer and it is in submission already. Currently, we can disentangle the head movement with audio-correlated information. Lol, thanks for the suggestion!

pcgreat commented 5 years ago

Sorry to bother you again, but I am still a little confused about the preprocessing of ATnet and VGnet. I didn't find explicit code for preprocessing the training data for ATnet and VGnet in lrw_data.py, so can you confirm how landmark is being preprocessed in both networks?

In the training phase of ATnet (my guess),

roi, landmark = crop_image(img_path) # first step of preprocessing
dst=similarity_transform(roi) # do we warp image frame by frame in training, or only warp the first frame of a video?
shape = dlib_detector(dst) # here the shape is the landmark of warped image
shape = normLmarks(shape) # use procrustes transform to frontalize landmark
# now shape is a 136 dim vector, and its scale is -0.2~0.2

Then I am not sure about the preprocessing procedure for VGnet, especially about how the ground truth landmark is being used. Here is my guess for the preprocessing of training phase in VGnet:

roi, landmark = crop_image(img_path) 
dst=similarity_transform(roi) # do we warp in VGnet as well?
shape = dlib_detector(dst) 
# here shape is the ground truth landmark, and its scale is 0~163. 

Does shape need to go through normLmarks()? If yes, then the head pose info is probably lost through the normalization; if no, then the ground truth landmark has a different scale compared to the output of ATnet. Or, is there another normalization function to bring the scale of gt lmark to -0.2~0.2 while still keeping the head pose position

lelechen63 commented 5 years ago

Sorry to bother you again, but I am still a little confused about the preprocessing of ATnet and VGnet. I didn't find explicit code for preprocessing the training data for ATnet and VGnet in lrw_data.py, so can you confirm how landmark is being preprocessed in both networks?

In the training phase of ATnet (my guess),

roi, landmark = crop_image(img_path) # first step of preprocessing
dst=similarity_transform(roi) # do we warp image frame by frame in training, or only warp the first frame of a video?
shape = dlib_detector(dst) # here the shape is the landmark of warped image
shape = normLmarks(shape) # use procrustes transform to frontalize landmark
# now shape is a 136 dim vector, and its scale is -0.2~0.2

Then I am not sure about the preprocessing procedure for VGnet, especially about how the ground truth landmark is being used. Here is my guess for the preprocessing of training phase in VGnet:

roi, landmark = crop_image(img_path) 
dst=similarity_transform(roi) # do we warp in VGnet as well?
shape = dlib_detector(dst) 
# here shape is the ground truth landmark, and its scale is 0~163. 

Does shape need to go through normLmarks()? If yes, then the head pose info is probably lost through the normalization; if no, then the ground truth landmark has a different scale compared to the output of ATnet. Or, is there another normalization function to bring the scale of gt lmark to -0.2~0.2 while still keeping the head pose position

I see. That is a good question. I reread my code. The NormLmarks() will not remove the movements. So when we train ATnet, we need to suffer the head pose movements. But since we use PCA, it can moderate the movements a little bit. But in this paper, we do not remove the head movement and we do not give condition for that to model movements. We talked about this problem in the paper and we think the ATnet will suffer less from the movement since the landmark is much more sparse and easier to model, which is an advantage to using landmark to moderate the movement. If we directly map audio/text to image with head movement, it is hard to model them. But in landmark space, we can still output reasonable results. Sorry about this confusion.

pcgreat commented 5 years ago

That makes sense. Let me make sure I understand this. So during the training of VGnet, each frame is (1) cropped using roi, landmark = crop_image(img_path), (2) then the roi is warped to the template (base_68.npy) based on similarity transform, so that eyes and nose will be fixed to certain points, aka dst=similarity_transform(roi) (3) detect the landmark of dst via shape = dlib_detector(dst) (4) normalize landmark via shape = normLmarks(shape) which normalizes landmark but still keeps pose info If that process is done frame by frame, the processed ground truth frames will be likely to suffer from a "zoom-in-and-out" condition due to some misalignment of facial landmark. But when I test the pretrained model, I don't see that problem. So I am wondering if you applied the above process to eahc video frame by frame, or you used some additional techniques to stabilize the video?

lelechen63 commented 5 years ago

That makes sense. Let me make sure I understand this. So during the training of VGnet, each frame is (1) cropped using roi, landmark = crop_image(img_path), (2) then the roi is warped to the template (base_68.npy) based on similarity transform, so that eyes and nose will be fixed to certain points, aka dst=similarity_transform(roi) (3) detect the landmark of dst via shape = dlib_detector(dst) (4) normalize landmark via shape = normLmarks(shape) which normalizes landmark but still keeps pose info If that process is done frame by frame, the processed ground truth frames will be likely to suffer from a "zoom-in-and-out" condition due to some misalignment of facial landmark. But when I test the pretrained model, I don't see that problem. So I am wondering if you applied the above process to eahc video frame by frame, or you used some additional techniques to stabilize the video?

Hi, sorry for the late reply. If I am right, the zoom-in-and-out is coursed by the blinking. Since we align the images based on the eyes' landmark. If you are using GRID dataset, you can use same crop coordinates for a whole video to avoid this problem since they do not move their head. If you ar e working on LRW dataset, current my solution is to warp the images based on two different templates (one with eyes opening and one with eyes closing). However, this solution can only moderate the problem.

pcgreat commented 5 years ago

Thanks so much, those answers are helpful!

jixinya commented 5 years ago

Sorry to bother you again, but I am still a little confused about the preprocessing of ATnet and VGnet. I didn't find explicit code for preprocessing the training data for ATnet and VGnet in lrw_data.py, so can you confirm how landmark is being preprocessed in both networks? In the training phase of ATnet (my guess),

roi, landmark = crop_image(img_path) # first step of preprocessing
dst=similarity_transform(roi) # do we warp image frame by frame in training, or only warp the first frame of a video?
shape = dlib_detector(dst) # here the shape is the landmark of warped image
shape = normLmarks(shape) # use procrustes transform to frontalize landmark
# now shape is a 136 dim vector, and its scale is -0.2~0.2

Then I am not sure about the preprocessing procedure for VGnet, especially about how the ground truth landmark is being used. Here is my guess for the preprocessing of training phase in VGnet:

roi, landmark = crop_image(img_path) 
dst=similarity_transform(roi) # do we warp in VGnet as well?
shape = dlib_detector(dst) 
# here shape is the ground truth landmark, and its scale is 0~163. 

Does shape need to go through normLmarks()? If yes, then the head pose info is probably lost through the normalization; if no, then the ground truth landmark has a different scale compared to the output of ATnet. Or, is there another normalization function to bring the scale of gt lmark to -0.2~0.2 while still keeping the head pose position

I see. That is a good question. I reread my code. The NormLmarks() will not remove the movements. So when we train ATnet, we need to suffer the head pose movements. But since we use PCA, it can moderate the movements a little bit. But in this paper, we do not remove the head movement and we do not give condition for that to model movements. We talked about this problem in the paper and we think the ATnet will suffer less from the movement since the landmark is much more sparse and easier to model, which is an advantage to using landmark to moderate the movement. If we directly map audio/text to image with head movement, it is hard to model them. But in landmark space, we can still output reasonable results. Sorry about this confusion.

Hi, I'm wondering what does 'The NormLmarks() will not remove the movements. ' mean. Since we first use affine transformation to fix the position of eyes and nose, and ATnet can only output the front-view landmark. So I guess NormLmarks() can frontalize the landmarks and remove identity information. Then NormLmarks() should remove the movements. Besides, when using landmarks which are not frontalized as input in ATnet(using the pretrained ATnet weights in the google drive link), I always get forntalized landmark sequence.

lelechen63 commented 5 years ago

Sorry to bother you again, but I am still a little confused about the preprocessing of ATnet and VGnet. I didn't find explicit code for preprocessing the training data for ATnet and VGnet in lrw_data.py, so can you confirm how landmark is being preprocessed in both networks? In the training phase of ATnet (my guess),

roi, landmark = crop_image(img_path) # first step of preprocessing
dst=similarity_transform(roi) # do we warp image frame by frame in training, or only warp the first frame of a video?
shape = dlib_detector(dst) # here the shape is the landmark of warped image
shape = normLmarks(shape) # use procrustes transform to frontalize landmark
# now shape is a 136 dim vector, and its scale is -0.2~0.2

Then I am not sure about the preprocessing procedure for VGnet, especially about how the ground truth landmark is being used. Here is my guess for the preprocessing of training phase in VGnet:

roi, landmark = crop_image(img_path) 
dst=similarity_transform(roi) # do we warp in VGnet as well?
shape = dlib_detector(dst) 
# here shape is the ground truth landmark, and its scale is 0~163. 

Does shape need to go through normLmarks()? If yes, then the head pose info is probably lost through the normalization; if no, then the ground truth landmark has a different scale compared to the output of ATnet. Or, is there another normalization function to bring the scale of gt lmark to -0.2~0.2 while still keeping the head pose position

I see. That is a good question. I reread my code. The NormLmarks() will not remove the movements. So when we train ATnet, we need to suffer the head pose movements. But since we use PCA, it can moderate the movements a little bit. But in this paper, we do not remove the head movement and we do not give condition for that to model movements. We talked about this problem in the paper and we think the ATnet will suffer less from the movement since the landmark is much more sparse and easier to model, which is an advantage to using landmark to moderate the movement. If we directly map audio/text to image with head movement, it is hard to model them. But in landmark space, we can still output reasonable results. Sorry about this confusion.

Hi, I'm wondering what does 'The NormLmarks() will not remove the movements. ' mean. Since we first use affine transformation to fix the position of eyes and nose, and ATnet can only output the front-view landmark. So I guess NormLmarks() can frontalize the landmarks and remove identity information. Then NormLmarks() should remove the movements. Besides, when using landmarks which are not frontalized as input in ATnet(using the pretrained ATnet weights in the google drive link), I always get forntalized landmark sequence.

The ATnet can not output any head movements. It can only yield frontlized face. I checked the code when I test VGnet, I also use the smoothed landmark processed by normLmark() function (dataset.py line 365.). And the testing results show I can output different angles (see ablation study in the original paper).

jixinya commented 5 years ago

Thanks! There's one more question. Did you use the method in 'Talking-Face-Landmarks-from-Speech' to frontalize the landmarks in ATnet? Cause when I try normLmark() in demo.py to process the data, I always get the same result. So I'm not sure how to frontalize landmarks since we only have 2D information.

lelechen63 commented 5 years ago

Thanks! There's one more question. Did you use the method in 'Talking-Face-Landmarks-from-Speech' to frontalize the landmarks in ATnet? Cause when I try normLmark() in demo.py to process the data, I always get the same result. So I'm not sure how to frontalize landmarks since we only have 2D information. Yes, that is another paper from my labmate.

ssinha89 commented 4 years ago

@lelechen63 The vgnet is trained using ground-truth 2d landmarks which are procrustes aligned and normalized to the range -0.2 to 0.2. I have seen that _lrwdata.py includes _lrw_gtprepare(), which in turn reads reads file norm.npy for each video frame. From the code _lrwdata.py it is unclear how the norm.npy is generated. The normLmarks() function in the code returns the same landmark for any input image, so as per my understanding it could not have been used to generate the normalized landmarks for training vgnet.py. Then how are the norm.npy files generated ? Could you kindly share the data preparation code for VGNET training for either GRID or LRW dataset ?

tlatlbtle commented 4 years ago

Hi, I wonder that when there only exist one face for one frame ( len(lmarks.shape) == 2 ), will "normLmarks" always output with the same results? I mark related lines in your code with "#".


def normLmarks(lmarks):
    norm_list = []
    idx = -1
    max_openness = 0.2
    mouthParams = np.zeros((1, 100))
    mouthParams[:, 1] = -0.06
    tmp = deepcopy(MSK)
    tmp[:, 48*2:] += np.dot(mouthParams, SK)[0, :, 48*2:]
    open_mouth_params = np.reshape(np.dot(S, tmp[0, :] - MSK[0, :]), (1, 100))

    if len(lmarks.shape) == 2:
        lmarks = lmarks.reshape(1,68,2)
    for i in range(lmarks.shape[0]):
        mtx1, mtx2, disparity = procrustes(ms_img, lmarks[i, :, :])
        mtx1 = np.reshape(mtx1, [1, 136])
        mtx2 = np.reshape(mtx2, [1, 136])
        norm_list.append(mtx2[0, :])
    pred_seq = []
    init_params = np.reshape(np.dot(S, norm_list[idx] - mtx1[0, :]), (1, 100))
    for i in range(lmarks.shape[0]):
        params = np.reshape(np.dot(S, norm_list[i] - mtx1[0, :]), (1, 100)) - init_params - open_mouth_params
######## "params" will always be equal to  (-open_mouth_params) ######## 
        predicted = np.dot(params, SK)[0, :, :] + MSK
        pred_seq.append(predicted[0, :])
    return np.array(pred_seq), np.array(norm_list), 1