tjiiv-cprg / EPro-PnP

[CVPR 2022 Oral, Best Student Paper] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation
https://www.youtube.com/watch?v=TonBodQ6EUU
Apache License 2.0
1.11k stars 106 forks source link

unstable (or not stable enough) translation of far objects #30

Closed qinyq closed 2 years ago

qinyq commented 2 years ago

Hi, this is really a fantastic work. I tried 4DoF model and it works seamlessly especially when the car are nearby and not truncated. However, I noticed that for objects that are a little far from the came (e.g. > 80 meters), the translation becomes unstable enough. For example, in frame n-1 the object might be 85 meters away but in frame n it may be 80 or 90 meters away. I understand that the relative precision might still acceptable but the absolute error should not take into account. So I'm wondering is there any approach to eliminate the error for far objects (both training or inferring are fine)?

Lakonik commented 2 years ago

Generally speaking, for stabler predictions it would be better to use multi-frame models. But in the nuScenes benchmark, only the objects within 50 meters are evaluated, and distant objects are mostly ignored during training. So the predictions will be highly inaccurate beyond 80 meters. The underlying reason is the difficulty in annotation because the 32-beam LiDAR is not dense enough to capture those distant objects.

qinyq commented 2 years ago

Generally speaking, for stabler predictions it would be better to use multi-frame models. But in the nuScenes benchmark, only the objects within 50 meters are evaluated, and distant objects are mostly ignored during training. So the predictions will be highly inaccurate beyond 80 meters. The underlying reason is the difficulty in annotation because the 32-beam LiDAR is not dense enough to capture those distant objects.

yea that's the problem. would it be better if we have more dense lidar points on distant objects. do you have any plan on integrating EPro-PnP to multi-camera & temporal sequences?

Lakonik commented 2 years ago

Not yet. I don't have enough time to work on temporal models at the moment.

qinyq commented 2 years ago

Not yet. I don't have enough time to work on temporal models at the moment.

Glad to hear that, I guess you may like to try digging deeper to mathematic methods.

I noticed the inferring outputs has pose_samples and pose_sample_weights. So if y* is not optimal (not precise nor stable in temporal sequence), would it be possible that the optimal pose are contained in these pose samples? I'm wondering if these samples would help human to manually adjust the pose(i.e. select from one of the local optimal pose and watch whether that pose is better).

Lakonik commented 2 years ago

The samples are to approximate the predicted distribution. If you check out the BEV visualizations, the distributions of the translation components look pretty much like multivariate Gaussian, and y* is usually at the very center. So actually the distribution itself is inconsistent (not stable) in a temporal sequence. It is possible though, to use some temporal filtering (either as post processing or integrated into an end-to-end temporal model) to fuse the distributions in adjacent frames, which should provide smoother and more accurate predictions.

qinyq commented 2 years ago

The samples are to approximate the predicted distribution. If you check out the BEV visualizations, the distributions of the translation components look pretty much like multivariate Gaussian, and y* is usually at the very center. So actually the distribution itself is inconsistent (not stable) in a temporal sequence. It is possible though, to use some temporal filtering (either as post processing or integrated into an end-to-end temporal model) to fuse the distributions in adjacent frames, which should provide smoother and more accurate predictions.

Thank you!