Questions related to the prediction values and rendered results

vujadeyoon commented 3 years ago

Dear Vítor Albiero,

Thanks for your helpful comments in the previous git issues. It was great help in understanding the paper, img2pose.

I have additional questions.

What is the definition of the proposed idea's result (i.e. img2pose prediction value) ? According to the equation (2) in your paper, 6D vector h_i consists of Euler angles and 3D face translation vectors. Also, you let me know the posepred includes rotation vectors, not the Euler angles. (Both information can be easily converted.) I clearly understand your comment and also cannot find the code where the rotation can convert to the Euler angles [2]. Thus, I hope to fully understand your paper, so I politely ask you again. Is it correct that the whole proposed method (i.e. network and post processing) makes the global pose, h{i}^{img} that consists of a 3D face translation vector and 3D rotation vector, not Euler information in the given entire image, not an image corp B that is defined in the Appendix A..
Questions related to the rendered results. Please refer to the below images in the [3]. Please note that the name of the image is the 27_Spa_Spa_27_32.jpg in the WIDERFACE training dataset. In other words, I think the model may already used the image in the training phase. I get the values for the boxes, labels and dofs from the lmdb dataset that is obtained by your guide. I think the values might be obtained correctly because the both functions, random_crop and random_clip are turned off. However, the result in [3]-b is a little odd. It makes me confused. If I get the GT values correctly, it is impressive that the generalization of the model is well trained by numerous other GTs and the results are nice although GT used for training may be inaccurate as far as you can see. For reference, please note that both the rendering results using prediction and GT values corresponding to the other images obtained in the same way as above were nice unlike [3], although I did not attach a picture.
- Is it correct that the GT - dof values in the lmdb dataset corresponds to h_{i}^{img*} for the given entire image (i.e. whole image) in the Fig. 4?
- What do you think that which factor determines the size of face? It's difficult to understand why the rotation vector determines the size of the face. Could you tell me your explanation for this?

[1] https://github.com/vitoralbiero/img2pose/issues/27#issuecomment-804506944 [2] In inference mode (i.e. test_own_images.ipynb), the proposed network performs transform module, transform = GeneralizedRCNNTransform(min_size, max_size, image_mean, image_std), at the end. However the transform does not include any rotation - Euler conversion method.\ [3] Rendered result for the image, 27_Spa_Spa_27_32.jpg. res a) The rendered result of the img2pose. res_gt b) The rendered result using the GT values that are obtained in the train.py using below additional codes:

dict = targets[0]
boxes = data_dict['boxes'].numpy().tolist()
labels = data_dict['labels'].numpy().tolist()
dofs = data_dict['dofs'].numpy().tolist()

vitoralbiero commented 3 years ago

Hello Sungjun Ethan Yoon,

Thank you again for your interest in our work!

Yes, our model predicts rotation vectors (not Euler angles) and translation vectors in h_i^img. We are going to release a new version of the paper to reflect this correction. We do not convert to Euler angles inside the img2pose model, as the rotation vector is only converted to Euler angles for validation on AFLW2000-3D and BIWI. The conversion from rotation vector to Euler angles is on both notebooks and the comment you mentioned [1]. Note that Euler angles suffer from a drawback, where the yaw is limited to (-90, 90), thus apart from the validation on these two datasets, we prefer to use rotation vectors in our pipeline instead.
You are right. Our model has better predictions than many of GT data as you pointed out in the example [3]. We attribute this to the generalization capabilities of deep networks, where even with noisy labels, the model still improves over the GT data.
- Yes, the lmdb files contain the global pose (h_i^img), but also contains the local pose (h_i^prop). Because of augmentations, during training (data_loader_augmenter.py), we use the GT landmarks and bboxes to recalculate the GT global poses. And during validation (data_loader_lmdb.py), we use the GT local pose to obtain the GT global pose. So, both data loaders output poses relative to the entire image (h_i^img), and you are correctly obtaining the GT pose in [3b].
- What determines the size of the face is the t_z component of the translation vector. You can easily test how this affects the face size by changing this value (pose_pred[5]). If you decrease the t_z component, you will see that the face gets larger, as it is now closer to the camera.

I hope this helps clear your questions.

lucaskyle commented 3 years ago

nice work nice paper!

here is my question. I check the local Pose to global Pose processing. seems like u guys didn't consider any distortion in widerface dataset. actually, they don't provide any information about that.

I testified a lot. local_pose_to_global_pose with considering distortion and without these kinds of information. the results were quite different.

if the face position in the image and camera intrinsic and distortion info affect a lot ( that means without this info, hardly u can make correct GT annotation), how can we compare with the wiki_test_dataset.

vitoralbiero commented 3 years ago

Hello @lucaskyle, thanks for your interest in our work!

We don't take into consideration any type of distortion when converting the poses. In our tests, we were able to achieve reliable GT without adding camera distortion, except for some outliers. Could you provide examples where you said the local_pose_to_global_pose worked differently depending on the distortion? Also, what dataset are you referring to as wiki_test_dataset?

Thanks!

lucaskyle commented 3 years ago

Hello @lucaskyle, thanks for your interest in our work!

We don't take into consideration any type of distortion when converting the poses. In our tests, we were able to achieve reliable GT without adding camera distortion, except for some outliers. Could you provide examples where you said the local_pose_to_global_pose worked differently depending on the distortion? Also, what dataset are you referring to as wiki_test_dataset?

Thanks!

I understand. when doing solvepnp method, there is a camera distortion matrix as an optional input. unless the input images are undistorted perfectly, u don't have to worry about that. but when u use wider face data to train headpose, I don't see any process to undistort image.

training: face landmarks--->sovlepnp(should considering distortion)--->getHP_local--->GTHP_global. testing: model results--->HP_global--->get HP_local(should considering GTdistortion)----> vs GTHP_local(coming from landmarks)

widerface doesnt offer any camera distortion, so we cant get very correct HP_local. also, biwi just cropped every face image from the big image(i guess), also we dont know the camera information.

I think neither training data and testing data were quite not reliable without considering distortion. because they are not coming from the same camera.

vitoralbiero commented 3 years ago

I understand the pipeline you are suggesting, but unfortunately, we do not have camera distortion information to do that. However, I think that without this info, our annotations are reliable enough that when we tested on AFL2000-3D and BIWI, we get SoTA predictions.

When I asked if you have an example where distortion is affecting the GT, I meant a visual example, where you were able to add the camera distortion parameters and get a better GT.

I think when you say wiki dataset you mean BIWI dataset. For BIWI and AFLW2000-3D, we use the provided GT Euler angles for rotation comparison, which other papers also do. For AFLW2000-3D, we use the provided landmarks to get the GT translation, where most other papers do not predict translation and do not have this comparison.

lucaskyle commented 3 years ago

thank you for your explaination. I understood.

vitoralbiero / img2pose

Questions related to the prediction values and rendered results #28