tjiiv-cprg / EPro-PnP

[CVPR 2022 Oral, Best Student Paper] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation
https://www.youtube.com/watch?v=TonBodQ6EUU
Apache License 2.0
1.11k stars 106 forks source link

About sampling points during training #6

Closed shanice-l closed 2 years ago

shanice-l commented 2 years ago

Hi, authors!

Recently I've been following your work EPro-PnP-6DoF.

I have a problem with sampling points during training. I've noticed that you randomly sampled 512 points in the output x2d/x3d images (code here). But there might be the issue with sampling some background points and disturbing the procedure of estimating object pose.

Have you ever considered this or conducted any experiments?

shanice-l commented 2 years ago

I think this problem could be solved with a well-trained mask (w2d), but the network didn't add any direct supervision to the mask. So I still consider this a problem.

Lakonik commented 2 years ago

Hi! Thank you for your interest in our work.

Since the sampled points are evenly distributed in the image (not affected by the mask at all), we think it is a decent approximation to using all 64x64 points, although no experiments have been conducted.

Anyway, we do not actually recommend using this sampling strategy to train a new model. We do this only because we seek minimal modification to the original network (where 64x64 points seem to be too many) for a strict comparison, while maintaining a reasonable training time.

If you'd like to find out if the point sampling actually disturbs the training, you may try training without sampling. Alternatively you can reduce the number of Monte Carlo samples (e.g. from 512 to 128 as in EPro-PnP-Det v1b) for faster training.

shanice-l commented 2 years ago

Hi Hansheng!

Thanks for your timely reply!

But I think my question doesn't lie in using too many points. According to equation(1) in your paper, x_i^3D belongs to the object model and x_i^2D is inner the mask of the image crop. So I think the Monte-Carlo forward process should sample points inner the mask. If the w2d is well-trained, points out of the mask should be assigned the weight zero. But the network didn't directly supervise the training of w2d. So I doubt that it would be a problem that affects training.

Lakonik commented 2 years ago

Actually there is no binary mask at all. The 2D points (x2d) shall cover every single pixel in the dense output, not just the foreground.

shanice-l commented 2 years ago

Ummm. The output will cover each pixel, but the x3d out of the mask is meaningless loss_rot = criterions[cfg.loss.rot_loss_type](loss_msk_var[:, :3] * noc, loss_msk_var[:, :3] * target_var[:, :3])

Only the foreground x3d is supervised.

Lakonik commented 2 years ago

In EPro-PnP, the coordinate regression loss (loss_rot) is regarded as an auxiliary loss for introducing extra geometrical supervision. You can even remove loss_rot and the training still works (79.46 in Fig. 7). In a nutshell, the background points are all handled by backpropagating the Monte Carlo pose loss. They don't necessarily have to be physical points lying on the surface of the object (see EPro-PnP-Det).

shanice-l commented 2 years ago

It makes sense.