[Bug/issue] Isssue ) RAC ; How to get the same evaluation result at the paper

minsu1206 commented 1 year ago

Hi Thank you for sharing nice work. Some advanced utilities (e.g visualization) are great!

I want to get the same evaluation result at RAC paper Specifically, Samba scene and Swing scene (like BANMo)

But I found that there is not enough explanation about evaluation of RAC at RAC & Lab4D github. I tried to evaluate the result of RAC following BANMo code (render_vis.py) and met some issues.

I will describe how I trained RAC model before details of issues.

I followed the code at https://lab4d-org.github.io/lab4d/tutorials/category_model.html There's no any changes except inst_id at export, from python lab4d/export.py --flagfile=logdir/$logname/opts.log --load_suffix latest --inst_id 0 to python lab4d/export.py --flagfile=logdir/$logname/opts.log --load_suffix latest --inst_id 16 or python lab4d/export.py --flagfile=logdir/$logname/opts.log --load_suffix latest --inst_id 20 because the 16th video in human-48 is same as the 1st video in T_samba && the 20th video in human-48 is same as the 1st video of T_swing.
Initially, I used preprocessed data from lab4d , and later, tried to use dataset without skipping frames by following https://lab4d-org.github.io/lab4d/tutorials/preprocessing.html (Indeed, later was not helpful for solving issues)

And then, I met two issues

1. No root pose of RAC

Could you provide guidance on how to extract the root pose from RAC? At my experience at BANMo, root pose was necessary to align GT mesh and predicted mesh for evaluation. But the result of python lab4d/export.py doesn't contain root pose information (e.g. cam-00000.txt)

2. Strange mesh result

This is my result ; Mesh at last frame of T_samba1 video. (visualized by meshlab)

I expected the result like below.

As you can see, wrinkles at leg appears and the coarse shape does not match the figure in the paper.

I would appreciate your help in resolving these issues.

Best regards.

gengshan-y commented 1 year ago

Hi @minsu1206, what you tried looks correct. If you want to compare to RAC without re-training, we provided model checkpoints that produced the numbers in the paper here.

To answer your questions, the root pose can be found here: logdir/human-48-category-comp/export_0000/fg/motion.json It stores the root to camera transformation as data["field2cam"].

The wrinkle issue is more complicated. It has something to do with frequency in NeRF positional encoding, and how is eikonal loss enforced. In short, either a high frequency or a low eikonal weight can cause this issue. The recent lab4d-PPR branch has it fixed, but we haven't verified the fix with lab4d-RAC.

Moving forward in terms of evaluation, we have an evaluation script in PPR branch, which is compatible with lab4d and will be merged.

Let me know if you have more questions.

minsu1206 commented 1 year ago

Thank you for your answer.

Now I try to use evaluation script in PPR branch, but still get strange results

I can't extract meshes using pretrained checkpoints at RAC github , due to some options. I guess that they are for BANMo flags, but not compatible with Lab4d
I also try to use the evaluation scripts at PPR branch. But the result is quites strange.

To reproduce this issue, I write what I've done below.

(1) data['field2cam'] only contains extrinsic matrix and lab4d/export doesn't make "camera.json". (note that I just followed tutorial4) Although evaluation script requires intrisic matrix, this matrix is not involved in evaluation process.Therefore, I ignore this matrix.

(2) I set pred_mesh_paths = List[path of extracted $seqname-mesh- .obj] and pred_camera_paths = List[path of extracted $seqname-cam-.txt] I saved $seqname-cam-{frame number:05d}.txt from data['field2cam'] in advance. I also saved $seqname-mesh-{frame number:05d}.obj from extracted meshes in advance.

Below codes reflect them.

# lab4d/projects/ppr/eval/compute_metrics.py ln61 ~
if args.pred_prefix == "": ... 
else:
    pred_mesh_paths = glob.glob("%s/%s-mesh-*.obj" % (args.testdir, args.pred_prefix)) # What I changed
    pred_camera_paths = sorted(glob.glob("%s/%s-cam-*.txt" % (args.testdir, args.pred_prefix)))  # What I changed
    ...

And this is the command which I used. python projects/ppr/eval/compute_metrics.py --testdir ../rac_result/human-48-skel-soft/export_0020 --gt_seq T_swing-1 --pred_prefix T_swing1 --fps 30

And this is the result

Jitting Chamfer 3D Loaded JIT 3D CUDA chamfer distance
found 150 groune-truth meshes /workspace/lab4d/projects/ppr/eval/eval_utils.py:83: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:245.) all_verts_pred = torch.tensor(all_verts_pred, device=device, dtype=torch.float32) ICP iteration 0: mean/max rmse=6.50e-02/6.50e-02; mean relative rmse=1.00e+00 ICP iteration 1: mean/max rmse=2.32e-02/2.32e-02; mean relative rmse=6.43e-01 ICP iteration 2: mean/max rmse=1.72e-02/1.72e-02; mean relative rmse=2.60e-01 ICP iteration 3: mean/max rmse=1.52e-02/1.52e-02; mean relative rmse=1.17e-01

...

Evaluating:: 0%| | 0/150 [00:00<?, ?it/s]Frame 0: CD=51.90cm, f@10cm=12.9%, f@5cm=3.2%, f@2cm=0.4% Frame 1: CD=51.52cm, f@10cm=12.9%, f@5cm=3.4%, f@2cm=0.5% Frame 2: CD=51.33cm, f@10cm=12.9%, f@5cm=3.5%, f@2cm=0.6% Evaluating:: 2%|███▎ | 3/150 [00:00<00:05, 28.01it/s]Frame 3: CD=51.40cm, f@10cm=12.7%, f@5cm=3.5%, f@2cm=0.6% Frame 4: CD=51.70cm, f@10cm=12.6%, f@5cm=3.4%, f@2cm=0.5% Frame 5: CD=51.92cm, f@10cm=12.4%, f@5cm=3.2%, f@2cm=0.4% Evaluating:: 4%|██████▋ | 6/150 [00:00<00:05, 26.61it/s]Frame 6: CD=52.10cm, f@10cm=12.2%, f@5cm=3.3%, f@2cm=0.4% Frame 7: CD=52.30cm, f@10cm=12.2%, f@5cm=3.4%, f@2cm=0.3% Frame 8: CD=52.42cm, f@10cm=12.4%, f@5cm=3.6%, f@2cm=0.5% Evaluating:: 6%|██████████ | 9/150 [00:00<00:05, 25.69it/s]Frame 9: CD=52.33cm, f@10cm=13.5%, f@5cm=3.9%, f@2cm=0.6% Frame 10: CD=52.25cm, f@10cm=15.3%, f@5cm=4.5%, f@2cm=0.8% Frame 11: CD=52.21cm, f@10cm=16.6%, f@5cm=4.8%, f@2cm=1.5% Evaluating:: 8%|█████████████▎ | 12/150 [00:00<00:05, 25.30it/s]Frame 12: CD=52.14cm, f@10cm=16.6%, f@5cm=5.1%, f@2cm=1.8% Frame 13: CD=52.00cm, f@10cm=16.3%, f@5cm=5.2%, f@2cm=1.9% Frame 14: CD=51.61cm, f@10cm=16.2%, f@5cm=5.3%, f@2cm=2.0%

...

Frame 132: CD=58.63cm, f@10cm=8.6%, f@5cm=2.3%, f@2cm=0.8% Frame 133: CD=58.71cm, f@10cm=8.1%, f@5cm=2.3%, f@2cm=0.8% Frame 134: CD=59.02cm, f@10cm=7.6%, f@5cm=2.3%, f@2cm=0.8% Frame 135: CD=59.16cm, f@10cm=7.5%, f@5cm=2.5%, f@2cm=0.9% Evaluating:: 91%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 136/150 [00:02<00:00, 71.71it/s] Frame 136: CD=59.16cm, f@10cm=7.6%, f@5cm=2.8%, f@2cm=1.0% Frame 137: CD=58.60cm, f@10cm=8.0%, f@5cm=3.0%, f@2cm=1.2% Frame 138: CD=57.93cm, f@10cm=8.5%, f@5cm=3.3%, f@2cm=1.4% Frame 139: CD=57.82cm, f@10cm=9.3%, f@5cm=3.8%, f@2cm=1.4% Frame 140: CD=57.75cm, f@10cm=10.0%, f@5cm=4.1%, f@2cm=0.6% Frame 141: CD=57.31cm, f@10cm=10.1%, f@5cm=3.6%, f@2cm=0.6% Frame 142: CD=56.68cm, f@10cm=10.0%, f@5cm=2.7%, f@2cm=0.7% Frame 143: CD=55.92cm, f@10cm=9.9%, f@5cm=2.6%, f@2cm=0.7% Frame 144: CD=55.21cm, f@10cm=9.7%, f@5cm=2.5%, f@2cm=0.7% Frame 145: CD=54.74cm, f@10cm=9.4%, f@5cm=2.5%, f@2cm=0.7% Frame 146: CD=54.74cm, f@10cm=9.3%, f@5cm=2.5%, f@2cm=0.8% Frame 147: CD=54.83cm, f@10cm=9.8%, f@5cm=2.7%, f@2cm=0.8% Evaluating:: 99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 148/150 [00:02<00:00, 81.33it/s]Frame 148: CD=54.82cm, f@10cm=10.6%, f@5cm=2.8%, f@2cm=0.9% Frame 149: CD=54.85cm, f@10cm=11.3%, f@5cm=3.0%, f@2cm=0.9% Evaluating:: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [00:02<00:00, 51.15it/s] Finished evaluation Avg chamfer dist: 53.71cm Avg f-score at d=10cm: 10.2% Avg f-score at d=5cm: 3.7% Avg f-score at d=2cm: 1.2%

I'm not sure what I did wrong. I want to know the right way to do it. Could you share your thoughts on those with me

Thanks.

gengshan-y commented 1 year ago

I think the issue is following

lab4d main branch saves deformed meshes in the camera space.
ppr saves deformed meshes in the object space

The simplest fix is to set the camera extrinsics to identity in the evaluation code. You can also visualize whether the reconstruction and GT are aligned, in the rendered .mp4 file generated by the eval script. Let me know how it works.

minsu1206 commented 1 year ago

Thank you for your guidance.

I tried 2 methods.

(1) Follow your instruction

It would be easier to fine the cause If I visualized mesh. By the way, setting predicted camera matrix to identity matrix seems the solution.

(T_swing 149th frame) (yellow : raw prediction / pink : prediction aligned by ICP (and others))

(T_swing 149th frame) (green : GT mesh / pink : same as above)

And the result is

... # swing scene
Finished evaluation
Avg chamfer dist: 11.37cm
Avg f-score at d=10cm: 83.3%
Avg f-score at d=5cm:  56.3%
Avg f-score at d=2cm:  25.2%

This result seems reasonable in that PPR's evaluation result at PPR main page shows chamfer distance around 10 and F-score around 60

(2) Fix some lines : enforce better alignment

I found that the evaluation protocol at PPR branch is different from BANMo's one. (some changes look trivial, but lead to large gap, I think) and was still concerned about the difference in the number presented in the paper. The size of predicted mesh at (1) also doens't match the one of GT well

My idea : How about moving GT & predicted mesh into origin and scaling the predicted mesh with coarse bbox size, (not median value (at BANMo) and not overall coarse bbox (at PPR branch))?

I changed some lines like below. (I used BANMo:scripts/render_vis.py at here. sorry for confusing)

# original
...
bbox_max = float((verts_gt.max(1)[0]-verts_gt.min(1)[0]).max().cpu())
verts_gt = obj_to_cam(verts_gt, Rmat_gt, Tmat_gt)

import chamfer3D.dist_chamfer_3D
import fscore
chamLoss = chamfer3D.dist_chamfer_3D.chamfer_3DDist()

## use ICP for ours improve resutls
fitted_scale = verts_gt[...,-1].median() / verts[...,-1].median()
verts = verts*fitted_scale

frts = pytorch3d.ops.iterative_closest_point(verts,verts_gt, \
                    estimate_scale=False,max_iterations=100)
...

# changed for RAC
...
bbox_max = float((verts_gt.max(1)[0]-verts_gt.min(1)[0]).max().cpu())
verts_gt = obj_to_cam(verts_gt, Rmat_gt, Tmat_gt)
verts_gt -= torch.mean(verts_gt, axis=1)
import chamfer3D.dist_chamfer_3D
import fscore
chamLoss = chamfer3D.dist_chamfer_3D.chamfer_3DDist()

## use ICP for ours improve resutls
verts -= torch.mean(verts, axis=1)
bbox_max_gt = float((verts_gt.max(1)[0] - verts_gt.min(1)[0]).max().cpu())
bbox_max_pred = float((verts_ours.max(1)[0] - verts_ours.min(1)[0]).max().cpu())
verts /= bbox_max_pred
verts *= bbox_max_gt
# fitted_scale = verts_gt[...,-1].median() / verts[...,-1].median()
# verts = verts*fitted_scale

frts = pytorch3d.ops.iterative_closest_point(verts,verts_gt, \
                    estimate_scale=False,max_iterations=100)
...

And the result is

(also T_swing 149th frame ; green : aligned GT mesh ; orange: aligned predicted mesh)

(swing scene) average CD : 6.773 average F-score 2% : 65.139

(samba scene) average CD : 6.875 average F-score 2% : 65.419

This evaluation result is quite close the numbers in the paper. And there were no severe problems when visualizing the meshes.

minsu1206 commented 1 year ago

So this is the last question (Sorry for making it longer ... )

Because both BANMo and RAC estimates root pose (=camera extrinsic matrix) and camera intrinsic matrix, the estimated scale can also be a subject of evaluation although this is quite ambiguous and challenging task (as shown at NeRF--, etc)

In this context, I wonder whether method (2) may be a kind of cheating or not.

In conclusion, How do you think about method (2) at above ? Is it okay to use ? (w.r.t fair comparison) I would appreciate if you share your insight.

Thanks.

p.s I found that the meshes at human-48-soft-skel/export0016 have different vertices size (48007 or 48008). This makes an error at here "all_verts_pred = torch.tensor(all_verts_pred, device=device, dtype=torch.float32)"

zz7379 commented 12 months ago

Thank you for your guidance.

I tried 2 methods.

(1) Follow your instruction

It would be easier to fine the cause If I visualized mesh. By the way, setting predicted camera matrix to identity matrix seems the solution.

(T_swing 149th frame) (yellow : raw prediction / pink : prediction aligned by ICP (and others))

(T_swing 149th frame) (green : GT mesh / pink : same as above)

And the result is
... # swing scene
Finished evaluation
Avg chamfer dist: 11.37cm
Avg f-score at d=10cm: 83.3%
Avg f-score at d=5cm:  56.3%
Avg f-score at d=2cm:  25.2%

I have the same problem when evaluating BANMo. Could you please share how to " setting predicted camera matrix to identity matrix"? Thanks a lot!

minsu1206 commented 12 months ago

@zz7379

Sure.

At compute_metrics.py

pred_mesh_paths = sorted(glob.glob("%s/%s-mesh-*.obj" % (args.testdir, args.pred_prefix))) # What I changed
pred_camera_paths = sorted(glob.glob("%s/%s-cam-*.txt" % (args.testdir, args.pred_prefix)))  # What I changed
cameras = np.stack([np.loadtxt(i) for i in pred_camera_paths], 0)
intrinsics = cameras[:, 3]
extrinsics = np.repeat(np.eye(4)[None], len(pred_mesh_paths), axis=0)
# extrinsics[:, :3] = cameras[:, :3] # comment this line !!

I just comment this line.

Since transformation with indentity matrix means that predicted mesh is left at camera space, so this makes sense.

JAMESYJL commented 12 months ago

@minsu1206 I am currently using PPR code for eval on the Eagle dataset, and I have some problem . I see that the code needs to use groundtrue Camera.Pmat.cal data. May I ask if he is the root pose? If so, why does its weight seem so huge? At compute_metrics.py： gt_name, gt_cam_id = args.gt_seq.split("-") gt_cam_path = "%s/%s/calibration/Camera%s.Pmat.cal" % (ama_path, gt_name, gt_cam_id) intrinsics_gt, Gmat_gt = load_ama_intrinsics(gt_cam_path) Thanks a lot!

gengshan-y commented 12 months ago

@minsu1206 What you did looks good. The metrics are designed for surface reconstruction quality (in the sense it is not entangled with camera extrinsics, or intrinsics), and we don't expect the method to be able to estimate scale correctly due to the fundamental scale ambiguity. Therefore, we fit the scale and SE(3) before eval.

It should be adopted if (2) generally aligns prediction with GT better than (1). In RAC eval, I recall what we did was to use height to align instead of median depth as follows. We also use the first frame to find scale and SE(3) and use them for the rest of the frames to make sure they are consistent over time.

            focal_correction_ratio = np.sqrt(focal[0]*focal[1]/K[0,0]/K[1,1])
            verts[...,2] -= verts[...,2].mean() * (1-1./focal_correction_ratio)
            if i==0: 
                fitted_scale = ((verts_gt[...,-1].max()+verts_gt[...,-1].min())/2) /\
                               ((verts[...,-1].max()+verts[...,-1].min())/2) # this is more accurate
            verts = verts*fitted_scale 

            if i==0: 
                frts = pytorch3d.ops.iterative_closest_point(verts,verts_gt, \
                        estimate_scale=False,max_iterations=100)
            verts = ((frts.RTs.s*verts).matmul(frts.RTs.R)+frts.RTs.T[:,None])

gengshan-y commented 12 months ago

@JAMESYJL Pmat is defined as GT object to camera transformations. If you are using the eagle data in BANMo, probably want to set it to identity for camera view 0 as here.

JAMESYJL commented 12 months ago

@gengshan-y Sorry for the late reply, thank you for your detailed explanation. I am currently encountering a new problem and would like to consult with you. I am using lab4d_ The PPR code evaluated the Eagle and Hands datasets, but yielded strange results. Regarding the rootpose, do both groundtrue and pred extractions set to identity? In Banmo's preprocessing, there is a cam.txt file in the folder that seems to store groundtrue extractions data. Can it be directly loaded as groundtrue extractions? Sorry for asking so much, thank you very much!

gengshan-y commented 11 months ago

I believe only GT needs to be set to identity for eagle and hands sequences.

JAMESYJL commented 11 months ago

I believe only GT needs to be set to identity for eagle and hands sequences. thank for reply！When I tested the Eagle dataset, I only set GT to identity and the CD value I ran was very abnormal, with a value of 57. When I set both GT and pred to identity, the result I ran was 11. Comparing the gt mesh and the pred mesh I output, it seems that 11 is a more accurate result, but I am not sure if this is correct. Thank you for your reply!

minsu1206 commented 11 months ago

Sorry for the late reply.

Thanks you for sharing your insight. The key point is to use the scale consistent over time or not ; consider temporal effect or not.

I hope my method has no logical error and believe the intention of evaluation would decide whether the scale contain temporal context or not.

All conversations at here helped me a lot. I'll close the isssue.

lab4d-org / lab4d