google-research / mint

Multi-modal Content Creation Model Training Infrastructure including the FACT model (AI Choreographer) implementation.
Apache License 2.0
497 stars 85 forks source link

Two strange point in kinetic and manual feature extraction. #62

Open ZeyuLing opened 1 year ago

ZeyuLing commented 1 year ago
  1. When calculating acceleration, your implementation is :

`def calc_average_acceleration( positions, i, joint_idx, sliding_window, frame_time ):

current_window = 0

average_acceleration = np.zeros(len(positions[0][joint_idx]))

for j in range(-sliding_window, sliding_window + 1):

    if i + j - 1 < 0 or i + j + 1 >= len(positions):

        continue

    v2 = (

        positions[i + j + 1][joint_idx] - positions[i + j][joint_idx]

    ) / frame_time

    v1 = (

        positions[i + j][joint_idx]

        **- positions[i + j - 1][joint_idx] / frame_time**

    )

    average_acceleration += (v2 - v1) / frame_time

    current_window += 1

return np.linalg.norm(average_acceleration / current_window)`

Is
`v1 = ( positions[i + j][joint_idx]

        **- positions[i + j - 1][joint_idx] / frame_time**

    ) `

right? Why not
` v1 = ( positions[i + j][joint_idx]

        **- positions[i + j - 1][joint_idx]**

    ) / frame_time`
  1. When calculating manual(geometric) feature, your implementation code including: `pose_features.append(

        f.f_angle("neck", "root", "zero", "y_unit", [70, 110])
    )
    
    pose_features.append(
    
        f.f_nplane("zero", "minus_y_unit", "y_min", "rwrist", -1.2 * f.hl)
    
    )
    
    pose_features.append(
    
        f.f_nplane("zero", "minus_y_unit", "y_min", "lwrist", -1.2 * f.hl)
    
    )`

These 3 features stands for: 1) angle between neck-hip, y axis is in range (70,110) degrees. Usually is true when dancing including bending over, and False during other occations. 2) Height of wrists are below 1.2 times upper arm length. Only true when dancing including floor movements, like squating.

The problem is, in your test cases, all features below are always False, so the std of them are 0. And, u use the mean and std of GT to normalize both pred manual feature and gt manual feature! So, if the generated feature contains motion like squating, bending over, the FID_g and diversity_g will be extremely big!