hysts / pytorch_mpiigaze

An unofficial PyTorch implementation of MPIIGaze and MPIIFaceGaze
MIT License
346 stars 85 forks source link

executing Demo Class with demo_mpiifacegaze_resnet_simple_14.yaml #32

Closed yuyaya-foifoi closed 3 years ago

yuyaya-foifoi commented 3 years ago

Again, thank you for sharing your great work. I have a few questions. It would be a great pleasure if you reply to me.

  1. When executing Demo class, where in the code are you assigning face.center(center of 3D locations of facial landmark, I guess), face.distance?
  2. What do the constants that are multiplied to this vector mean?
  3. How is gaze, the ground truth of the dataset, created? Do you know anything about it? (I am trying to create a dataset, and if you can tell me anything about it, it would be greatly appreciated.)

Thank you,

hysts commented 3 years ago

Hi, @yuyaya-foifoi

  1. Here and here.
  2. No special meaning. It's just the length of gaze vectors for visualization. The gaze vectors are normalized and have unit lengths, so we need to make them long enough for good visualization.
  3. You might want to read the MPIIGaze paper. I have no more knowledge than what's written in it.
yuyaya-foifoi commented 3 years ago

Thank you for your quick response. @hysts

About 1 and 3 thank you I got it. About 2, in my understanding, gaze_vector was a denormalized vector that is identical to 3D coordinates in the camera coordinate system. So my understanding might be incorrect?

hysts commented 3 years ago

@yuyaya-foifoi

Ah, sorry. My use of the word "(de)normalize" seems to have confused you. It means to normalize the head pose, not to make the gaze vector into a unit vector. About the normalization, take a look at the section 4.2 of the MPIIGaze paper.

yuyaya-foifoi commented 3 years ago

@hysts

Thank you, my understanding is that normalization is necessary to handle different camera/person images.

Let me check to make sure that my understanding so far is correct.

  1. Is the output here is the coordinates of the camera coordinate system, as described in the last sentence of 3.2 in this paper?

  2. Scaling gaze_vector by multiplying a constant is just for visualization reason.

I'm sorry, but I have an additional question.

  1. The distance from the origin in the camera coordinate system to the reference point (center of the face) is set here for calculating the scaling matrix, but am I correct that this is an assumption of the distance between the camera and a reference point? (Which does not necessarily apply to other cases.)

  2. Can this be generalized? For example, it won't be applied for 2 by 2 rotation matrices and vectors which has a shape (2, 1).

hysts commented 3 years ago

@yuyaya-foifoi

  1. Basically, yes. A slight difference is that in the last sentence of section 3.2 of the paper, M-1 is multiplied, but here S is ignored and only R-1 is multiplied because scaling a vector doesn't change its orientation.

  2. Yes, as long as this line is concerned.

  3. You're right. It's the ds in section 3.2 of this paper. Since it's used to normalize face images, it should be the same value as the one used to create the training data.

  4. Yes, it can be generalized to any dimension. What do you mean it won't be applied to 2D cases? I'm confused. If A is an orthogonal matrix, AT = A-1. So (vT A)T = AT v = A-1 v, where v is a column vector. As a 2D rotation matrix is orthognal, it holds in 2D cases too.

yuyaya-foifoi commented 3 years ago

Thank you so much for your quick reply. @hysts

  1. If S is ignored, that means the absolute distance between the origin of the normalized camera coordinate system and the reference point will no longer be fixed?

  2. Thank you I understood.

  3. Hmm, I see... That's not a small constraint. Do you have any idea to remove this constraint? If I multiply by M during training and ignore S during inference, then will end up with data drift so...

  4. Thank you I understood the transformation you shared with me. Simply I thought you are asserting CodeCogsEqn holds. I might have missed transpose somewhere?

hysts commented 3 years ago

@yuyaya-foifoi

1 & 3. There seems to be some misunderstanding. The same preprocess should be applied to images during training and inference, and indeed the code does so. The difference I mentioned is not part of the preprocessing, but it's a post-processing after the model has predicted the gaze vector in the normalized head coordinate system. The length of the gaze vector means nothing, so we just ignore S in the post-processing.

yuyaya-foifoi commented 3 years ago

@hysts

Thank you for explaining it so well. My questions have been cleared up.