About the training data of v2

EasternJournalist commented 4 months ago

Hello! Thanks for the great work.

I evaluated the v2 version on NYU Depth v2 and KITTI, and surprisingly found that evaluation results are extremely great on both datasets. The FoV relative err is 3.7%. And the AbsRel affine-invariant depth error is < 3.5%, which significantly surpasses all SOTA competitors. Since v2 achieves so much better than v1, I am wondering if v2 is still trained on the same data as v1?

lpiccinelli-eth commented 4 months ago

V2 is trained on more data, as the new decoder design allows to train on non-metric data, too. We added HRWSI, MegaDepth, ARKitScenes, DIML (indoor), and HyperSim.

The following problems arise with those datasets:

ARKitScenes uses iPad LiDAR which presents many artifacts
MegaDepth HRWSI and DIML have quite blurry and noisy depth

One consideration from my side: NYU and KITTI are historical benchmarks for depth, but "real" zero-shot ability measured on those datasets is only partial. That is why in the paper we tried to test on many different datasets although it was still not totally comprehensive. Unfortunately for the field, a real-world and large-scale evaluation for (metric) depth is still missing.

EasternJournalist commented 4 months ago

V2 is trained on more data, as the new decoder design allows to train on non-metric data, too. We added HRWSI, MegaDepth, ARKitScenes, DIML (indoor), and HyperSim.

The following problems arise with those datasets:

ARKitScenes uses iPad LiDAR which presents many artifacts

MegaDepth HRWSI and DIML have quite blurry and noisy depth

One consideration from my side: NYU and KITTI are historical benchmarks for depth, but "real" zero-shot ability measured on those datasets is only partial. That is why in the paper we tried to test on many different datasets although it was still not totally comprehensive. Unfortunately for the field, a real-world and large-scale evaluation for (metric) depth is still missing.

Thanks for your reply! I totally agree that it is difficult to set up. the benchmark for zero-shot ability in monocular depth estimation. I am also working on such projects. And I am shocked to find that the SOTA performance on those popular evaluation sets are been pushed higher so fast. However they are still unsatisfying on real world casual images...

Would you let me know if there will be a standalone paper for v2? It seems there are lots of new contributions compared to v1. (I am just curious about it. It is okay to keep the answer secret.)

jlazarow commented 4 months ago

(Apologies for bumping a closed thread). @lpiccinelli-eth have you tried using the HR depth maps from ARKitScenes? They are rendered from a Faro sensor acquisition (similar to DIODE) and if you had feedback, I would be happy to talk with the team to figure it out. AFAIK, the only artifacts should be missing holes for where there was simply no data or reflective/translucent objects.

lpiccinelli-eth commented 4 months ago

I have not tried using the HR depth for two main reasons, the first is that it accounts for less than half of scenes (which is actually not a really big deal) and the second one is that ARKitScenes IMU for lowres images (the only ones available) is quite noisy thus the projection is not correct.

I can use directly the "upsampling" dataset only, but one of ARKitScenes strengths was to have video information, imho, which is needed for other projects. Therefore, I would really appreciate it if camera poses corresponding to the highres images (aligned with FARO) were released!

jlazarow commented 4 months ago

Thanks for the feedback. Just to make sure I'm on the same page:

You'd like high frequency (e.g. > 2 Hz) Faro GT frames for as many videos as possible (we've recovered more).
Additionally, you'd like the "GT" (Faro registered to iPhone space) pose alongside those frames (I believe you're using the estimated ARKit pose instead, which is quite noisy). But this is probably beyond what is needed for pure depth estimation (but could power other projects/interests).

lpiccinelli-eth commented 4 months ago

Yes, exactly, this would be really nice to have.

Concerning the second point, having the poses would allow the dataset to be used for 3D reconstruction projects, too, in addition to multi-view (video) depth estimation.

lpiccinelli-eth / UniDepth

About the training data of v2 #42