princeton-vl / pytorch_stacked_hourglass

Pytorch implementation of the ECCV 2016 paper "Stacked Hourglass Networks for Human Pose Estimation"
BSD 3-Clause "New" or "Revised" License
465 stars 94 forks source link

How are the parameters of the dataset calculated? #19

Closed williamdwl closed 3 years ago

williamdwl commented 3 years ago

the parameters of the dataset ,such as : center, torsoangle, normalize, scale. How to get? Was it calculated by part?

crockwell commented 3 years ago

If I recall correctly these are were calculated as a function of ground truth keypoints. Calculation was done before putting into the h5 files. In terms of the exact detailed calculation I'm not sure, but center is a function of average joint location (perhaps based on extreme left, right etc.) and scale is based on I think the person's height.

This repo actually performs these calculations, and I think the original calculation of the hourglass paper is pretty similar.

Torsoangle is not used, and you can probably just ignore this.

williamdwl commented 3 years ago

crockwell,Thank you for your response! I have tried to calculation the center by your provided idea that averaged the joints location(extreme (left, right) or (up, bottom) or (left, right, up, bottom)), but it looks like not correct. Why is center‘x an integer value and center’s y with a decimal point? Can you upload the calculation file? I am so puzzled. Lol! I will see the repo later. And I have another question. It is hard for me to understand the code of kpt_affine( ) and transform( ) function in img.py. I know we need to zoom coordinates of parts. But how the mat_mask of the function's parameters work to do? How to zoom coordinates? What is the principle of transformation? Looking forward to your reply, thanks!

williamdwl commented 3 years ago
==================== img.py=====================
def get_transform(center, scale, res, rot=0):
    # Generate transformation matrix
    h = 200 * scale
    t = np.zeros((3, 3))
    t[0, 0] = float(res[1]) / h
    t[1, 1] = float(res[0]) / h
    t[0, 2] = res[1] * (-float(center[0]) / h + .5)
    t[1, 2] = res[0] * (-float(center[1]) / h + .5)
    t[2, 2] = 1
    if not rot == 0:
        rot = -rot  # To match direction of rotation from cropping
        rot_mat = np.zeros((3, 3))
        rot_rad = rot * np.pi / 180
        sn, cs = np.sin(rot_rad), np.cos(rot_rad)
        rot_mat[0, :2] = [cs, -sn]
        rot_mat[1, :2] = [sn, cs]
        rot_mat[2, 2] = 1
        # Need to rotate around center
        t_mat = np.eye(3)
        t_mat[0, 2] = -res[1] / 2
        t_mat[1, 2] = -res[0] / 2
        t_inv = t_mat.copy()
        t_inv[:2, 2] *= -1
        t = np.dot(t_inv, np.dot(rot_mat, np.dot(t_mat, t)))
    return t
def transform(pt, center, scale, res, invert=0, rot=0):
    # Transform pixel location to different reference
    t = get_transform(center, scale, res, rot=rot)
    if invert:
        t = np.linalg.inv(t)
    new_pt = np.array([pt[0], pt[1], 1.]).T
    new_pt = np.dot(t, new_pt)
    return new_pt[:2].astype(int)
def kpt_affine(kpt, mat):
    kpt = np.array(kpt)
    shape = kpt.shape
    kpt = kpt.reshape(-1, 2)
    return np.dot(np.concatenate((kpt, kpt[:, 0:1] * 0 + 1), axis=1), mat.T).reshape(shape)
===========================================

========================dp.py===============
mat_mask = utils.img.get_transform(center, scale, (self.output_res, self.output_res), aug_rot)[:2]
#  out shape 2 x 3
mat = utils.img.get_transform(center, scale, (self.input_res, self.input_res), aug_rot)[:2]
inp = cv2.warpAffine(cropped, mat, (self.input_res, self.input_res)).astype(np.float32) / 255
# 
keypoints[:, :, 0:2] = utils.img.kpt_affine(keypoints[:, :, 0:2], mat_mask)  
crockwell commented 3 years ago

Unfortunately I don't have access to the calculation file. If one is a float, I'm guessing it's normalized by image size, as opposed to pixel value. My point was, others have used the same dataset and similar center, scale parameters so your best bet is to probably take some time and examine those.

Yeah these functions are not necessarily easy to understand, or explain unfortunately :/ I'd say you should use pdb and step through, to try to figure out exactly how they work. Again, other repos in vision / pose estimation would have used similar functions for cropping / image manipulation, so you could look there for inspiration as well.

williamdwl commented 3 years ago

I can roughly see that the parts' coordinates of 256x256 pixels are converted to 64x64 pixels by a matrix linear transformation. Similarly, we can use an inverse linear transformation to convert the 64x64 pixel parts' coordinates back to 256x256 pixel coordinates. But in the test.py, why do we first multiply the 64x64 pixel parts' coordinates by 4 as the function of kpt_affine's input? I found that the deviation between the output and original 256x256 pixel coordinates is quite large, and the 64x64 pixel parts coordinates are 4 or 5 pixels value difference every time we get them over the network.

===================test.py==================================================

image

=========================================================================

image

crockwell commented 3 years ago

Yeah there is a slight adjustment during post-processing. You may find this (or some other closed issues) useful: https://github.com/princeton-vl/pytorch_stacked_hourglass/issues/15 for more information. To summarize, if I recall the adjustment is due to 64x64 preprocessing that the code is trying to balance out during post-processing