tijiang13 / InstantAvatar

InstantAvatar: Learning Avatars from Monocular Video in 60 Seconds (CVPR 2023)
353 stars 29 forks source link

anim_nerf.npz #79

Open serizawa-04013958 opened 1 month ago

serizawa-04013958 commented 1 month ago

Hello I'd like to confirm your preprocess code result. when I try to preprocess people snap shot dataset, the results does not seem to be difficult from animnerf.npz. here is a result. since I can get good result when I use animnerf, I would like to ask you the difference. I used InstantAvatar/tree/master/scripts/visualize-SMPL.py for visualization anim_nerf ↓ https://github.com/user-attachments/assets/4b337b02-171d-4c24-84df-f2b7579ca363 pose_optimized ↓ https://github.com/user-attachments/assets/058a1e3f-9562-4d91-87a8-81d07b767751

Thank you

tijiang13 commented 1 month ago

Hi Serizawa,

Based on the visualization you provided, it seems that the wrong intrinsic parameters may have been loaded. The Anim-NeRF version uses the provided GT intrinsics, which have a larger focal length. However, the preprocessing pipeline we provided utilizes ROMP, which assumes a much smaller focal length. It’s possible that the camera parameters were accidentally overwritten when you run the preprocessing process. As a result, you likely used the ROMP camera to project the Anim-NeRF poses, leading to an overly small reprojection.

Best, Tianjian

tijiang13 commented 1 month ago

I guess the documentt can be a bit of confusing. For the benchmark on PeopleSnapshot, we used the GT camera and poses to isolate the impact of inaccuracies stemming from cameras and poses. This was done to illustrate performance under controlled settings(or let's say, what will happen with a better pose estimator as better pose estimator papers come out every year :P). In contrast, the examples on Neuman data were provided to demonstrate performance in less controlled settings, where we have no prior knowledge of the cameras or poses.

serizawa-04013958 commented 1 month ago

Thank you for replying immediately!! let me confirm, anim_nerf.npz is GT camera/pose parameter, right?

so, I understood that ROMP preprocess is not optimal for People snapshot dataset. but I'd like to treat peoplesnapshot dataset like wild dataset(neuman).

I work on synthetic dataset using people snapshot, and I would like to use same camera pose estimator, but current ROMP can not achieve same quality as GT. Can you help me if possible?

Thank you very much.

tijiang13 commented 3 weeks ago

Hi Serizawa,

Maybe you can give 4DHuman a shot? We have done some internal experiments before and it has better alignment empirically.

Best, Tianjian

serizawa-04013958 commented 3 weeks ago

Hello so, you mean anim_nerf.npz comes from Ground truth, but you did some internal experiment with 4DHuman, then it was similar to GT alignment. is it correct?

I mean to ask how to get below accurate npz file. I'm very glad to confirm.

image

serizawa-04013958 commented 2 weeks ago

@tijiang13 If possible, coulda you share the code to convert 4D human output to ROMP format? I could work 4D-human's sample code, but 4D-human's format is different from ROMP format...

AlecDusheck commented 1 week ago

This would be helpful!

tijiang13 commented 1 week ago

Hello Serizawa and Alec,

Sorry for the delayed reply -- I have been quite busy during the past 2 weeks and forgot to check Github regularly.

Here is the code I was using:

# process the SMPL poses
body_pose = np.zeros((NUM_PERSONS, NUM_FRAMES, 23, 3))
global_orient = np.zeros((NUM_PERSONS, NUM_FRAMES, 3))
transl = np.zeros((NUM_PERSONS, NUM_FRAMES, 3))
betas = np.zeros((NUM_PERSONS, NUM_FRAMES, 10))
for i, datum in enumerate(data):
    for j, person_id in enumerate(datum["personid"]):
        person_id = int(person_id)
        cx, cy = datum["box_center"][person_id]
        bbox_size = datum["box_size"][person_id]
        img_size = datum["img_size"][person_id]
        W, H = img_size

        # for cam_t we use pred_cam rather than pred_cam_t
        cam_t = datum["pred_cam"][person_id]
        tz, tx, ty = cam_t
        scale = 2 / max(bbox_size * tz, 1e-9)
        tz = focal_length * scale
        tx = tx + scale * (cx - W * 0.5)
        ty = ty + scale * (cy - H * 0.5)
        cam_t = np.array([tx, ty, tz])

        # convert back pose to SMPL format
        body_pose_R = datum["body_pose"][person_id]
        body_pose_ji = np.stack([cv2.Rodrigues(r)[0].squeeze(-1) for r in body_pose_R])
        global_orient_R = datum["global_orient"][person_id]
        global_orient_ji = cv2.Rodrigues(global_orient_R[0])[0].squeeze(-1)

        body_pose[j, i] = body_pose_ji
        global_orient[j, i] = global_orient_ji
        transl[j, i] = cam_t
        betas[j, i] = datum["betas"][person_id]

# process the camera
img_size = data[0]["img_size"][0]
W, H = img_size
intrinsic = np.array([[focal_length, 0, W * 0.5],
                      [0, focal_length, H * 0.5],
                      [0,            0,       1]])
extrinsic = np.broadcast_to(np.eye(4), (NUM_FRAMES, 4, 4))

After the conversion you will be able to visualize the SMPL meshes using the visualize_SMPL.py script in this repo.

Best, Tianjian

tijiang13 commented 1 week ago

Note: One of the good thing with 4DHuman is its ability to set the focal length for your GT camera when the intrinsics are available. This can be particularly useful when the true focal length differs significantly from the default settings such as in ROMP. This is surprisingly common when it comes to humans (people tend to use very large focal lengths).

serizawa-04013958 commented 1 week ago

That's great!! I appreciated it!

I'll get to work on the code right away in my environment. Thank you so much! For debugging, let me keep this issue

tijiang13 commented 1 week ago

You are welcome :D

Best, Tianjian

serizawa-04013958 commented 1 week ago

Hello, let me ask question. how to caluculate focal_length? Do you use same equation which is used in hmr2.py? I mean output['focal_length']

here is 4D-humans's code

focal_length = self.cfg.EXTRA.FOCAL_LENGTH * torch.ones(batch_size, 2, device=device, dtype=dtype)

tijiang13 commented 1 week ago

Hi Serizawa,

You can just run 4DHuman with default hyper-parameters. The code above just illustrates how to change the focal length & SMPL parameters accordingly if you want to set the focal length to a different value.

Best, Tianjian

serizawa-04013958 commented 1 week ago

Hello. I tried above code, but smpl pose is a little bit weird... I used model's output from HMR2 on hmr2.py here is my code. could you advise me? If possible, I want to know detail about data(datum) and saving code of data.

visualize result by visualize_SMPL.py image

4D-human's output worked well. image

pred_smpl_params = out['pred_smpl_params'] body_pose_R = pred_smpl_params["body_pose"].detach().cpu().numpy()[0] body_pose_ji = np.stack([cv2.Rodrigues(r)[0].squeeze(-1) for r in body_pose_R]) global_orient_R = pred_smpl_params["global_orient"].detach().cpu().numpy()[0] global_orient_ji = cv2.Rodrigues(global_orient_R[0])[0].squeeze(-1) body_pose[0, id] = body_pose_ji global_orient[0, id] = global_orient_ji transl[0, id] = cam_t betas[0, id] = pred_smpl_params["betas"].detach().cpu().numpy()[0]

tijiang13 commented 3 days ago

I think there’s a difference because 4DHuman only renders the object within the bounding box, while we project it onto the entire image. As for the misalignment, I still run the refinement as before, and this is usually easy to fix I guess.

Best, Tianjian