Evaluation differences compared to prior work

jefftan969 commented 2 weeks ago

Thanks for your great work, the results are amazing!

Just curious why the evaluation tables in MoGe often have different baseline numbers than the numbers reported in the original papers?

Here are some examples:

DUSt3R comparisons on NYUv2 In MoGe paper, Table 2 (scale-invariant pointmap): Rel 5.56, Delta1 97.1 In MoGe paper, Table 2 (affine-invariant pointmap): Rel 4.49, Delta1 97.4 In MoGe paper, Table 3 (scale-invariant depth): Rel 4.43, Delta1 97.1 In DUSt3R paper, Table 2 (depth): Rel 6.50, Delta 94.09

DUSt3R comparisons on KITTI In MoGe paper, Table 2 (scale-invariant pointmap): Rel 21.9, Delta1 63.6 In MoGe paper, Table 2 (affine-invariant pointmap): Rel 18.0, Delta1 66.7 In MoGe paper, Table 3 (scale-invariant depth): Rel 7.71, Delta1 90.9 In DUSt3R paper, Table 2 (depth 512): Rel 10.74, Delta 86.60

Marigold comparisons on NYUv2 In MoGe paper, Table 3 (affine-invariant depth): Rel 4.63, Delta1 97.3 In Marigold paper, Table 1 (depth w/ ensemble): Rel 5.5, Delta1 96.4

Marigold comparisons on KITTI In MoGe paper, Table 3 (affine-invariant depth): Rel 7.29, Delta1 93.8 In Marigold paper, Table 1 (depth w/ ensemble): Rel 9.9, Delta1 91.6

Marigold comparisons on ETH3D In MoGe paper, Table 3 (affine-invariant depth): Rel 6.08, Delta1 96.3 In Marigold paper, Table 1 (depth w/ ensemble): Rel 6.5, Delta1 96.0

Marigold comparisons on DIODE In MoGe paper, Table 3 (affine-invariant depth): Rel 6.34, Delta1 94.3 In Marigold paper, Table 1 (depth w/ ensemble): Rel 30.8, Delta1 77.3

DepthAnythingV2 comparisons on Sintel In MoGe paper, Table 3 (affine-invariant disparity): Rel 21.4, Delta1 72.8 In DepthAnythingV2 paper, Table 5 (take their best result): Rel 48.7, Delta1 75.2

EasternJournalist commented 2 weeks ago

Hi. Thanks for your interest and this valuable question. The evaluation results are different because the datasets are processed differently from these works. We have meticulously processed the raw evaluation datasets to assure reliability of ground truth data (e.g., removing inaccurate regions & cropping).

To maintain a fair comparison, we re-evaluated all baselines in this paper with the same processed data, under their default/recommended settings, rather than simply adopting the performance metrics reported in their original papers.

Each dataset underwent specific processing to ensure a reliable evaluation. Full details are provided in Section B.2 of our supplementary material https://arxiv.org/pdf/2410.19115.

For instance, we omit sky regions in the Sintel dataset because sky depth is not quantifiable—the "ground truth depth" could be any large value, such as 34, 50, or even beyond 2000. Evaluating models with sky depth included is not meaningful, which might be why you observe 48.7% AbsRel fromDepthAnythingV2 (which is a incredibly large relative error!), while their 75.2% Delta1 is normal - In some test cases, their predicted depths align with the sky, disregarding the actual foreground objects.

Another example is the DIODE dataset, where we remove boundary artifacts. This practice is also adopted by some previous works (e.g., UniDepth) to prevent significant bias from ground truth artifacts. In contrast, Marigold may not have applied such preprocessing, which could explain why their reported 30.8% AbsRel is potentially misleading.

We are committed to ensuring that our evaluation process is both reproducible and transparent. To that end, we plan to release our code for evaluation. Please stay tuned for further announcements regarding its availability.

EasternJournalist commented 1 week ago

Apart from the differences in data processing, it is also crucial to consider the evaluation configurations, such as 'metric depth', 'scale-invariant depth', 'affine-invariant depth', and 'affine-invariant disparity (inverse depth)'. These terms refer to various ways of aligning the predicted depth values with the ground truth, considering factors like scale and shift adjustments. The key distinction lies in whether and how the scale and offset are calibrated against the ground truth.

For example, 'scale-invariant' measures adjust for any uniform scaling in depth predictions, while 'affine-invariant' methods compensate for both scaling and shifting. Each configuration can yield significantly different results due to these adjustments. Thus, it's important to understand that the performance numbers reported are only directly comparable when they are calculated under the same evaluation framework. Ensuring consistency in this aspect is essential for fair and meaningful comparisons across different methods.

microsoft / MoGe

Evaluation differences compared to prior work #11