mbanani / probe3d

[CVPR 2024] Probing the 3D Awareness of Visual Foundation Models
MIT License
277 stars 14 forks source link

DINO Depth Evaluation Metrics Slightly Different From Paper #2

Closed gheinrich closed 1 week ago

gheinrich commented 7 months ago

Hello, thank you very much for this excellent work!

I am trying to reproduce the paper results on Depth Estimation using NAVI and DINO B16:

$ python train_depth.py backbone=dino_b16 +backbone.return_multilayer=True dataset=navi
...
2024-04-16 10:47:42.672 | INFO     | __main__:train_model:273 - Evaluating on test split of NAVI_wild_all
2024-04-16 10:48:10.552 | INFO     | __main__:train_model:276 - Scale-Aware Final test loss       | 2.0804
2024-04-16 10:48:10.553 | INFO     | __main__:train_model:278 - Final test SA d1         | 0.5971
2024-04-16 10:48:10.553 | INFO     | __main__:train_model:278 - Final test SA d2         | 0.8979
2024-04-16 10:48:10.553 | INFO     | __main__:train_model:278 - Final test SA d3         | 0.9807
2024-04-16 10:48:10.553 | INFO     | __main__:train_model:278 - Final test SA rmse       | 0.0514
2024-04-16 10:48:36.791 | INFO     | __main__:train_model:285 - Scale-Invariant Final test loss       | 2.0804
2024-04-16 10:48:36.791 | INFO     | __main__:train_model:287 - Final test SI d1         | 0.9904
2024-04-16 10:48:36.791 | INFO     | __main__:train_model:287 - Final test SI d2         | 0.9998
2024-04-16 10:48:36.791 | INFO     | __main__:train_model:287 - Final test SI d3         | 1.0000
2024-04-16 10:48:36.791 | INFO     | __main__:train_model:287 - Final test SI rmse       | 0.0138

This is different from the corresponding row in Table 2 of the paper: image

For example NAVI SI RMSE for DINO B16 is 0.1043 in the paper, vs. 0.0138 in my repo.

Am I misinterpreting the results?

Thank you!

gheinrich commented 7 months ago

Ah, do I need to use navi_reldepth? Using dinov2_b14 this gives me this, which is close (though slightly below) to the results from the paper:

2024-04-17 06:09:04.545 | INFO     | __main__:train_model:285 - Scale-Invariant Final test loss       | 4.6468
2024-04-17 06:09:04.545 | INFO     | __main__:train_model:287 - Final test SI d1         | 0.5987
2024-04-17 06:09:04.545 | INFO     | __main__:train_model:287 - Final test SI d2         | 0.8322
2024-04-17 06:09:04.545 | INFO     | __main__:train_model:287 - Final test SI d3         | 0.9159
2024-04-17 06:09:04.546 | INFO     | __main__:train_model:287 - Final test SI rmse       | 0.0948
Linusnie commented 7 months ago

Hi, I am facing similar issues. Here is what I got training the probes on dino_b16, dinov2_b14 and dinov2_l14 all with dataset=navi_reldepth

dino_b16:

2024-04-19 19:38:14.764 | INFO     | __main__:train_model:290 - Scale-Invariant Final test loss       | 5.9778
2024-04-19 19:38:14.787 | INFO     | __main__:train_model:292 - Final test SI d1         | 0.4684
2024-04-19 19:38:14.827 | INFO     | __main__:train_model:292 - Final test SI d2         | 0.7336
2024-04-19 19:38:14.858 | INFO     | __main__:train_model:292 - Final test SI d3         | 0.8570
2024-04-19 19:38:14.889 | INFO     | __main__:train_model:292 - Final test SI rmse       | 0.1304

dinov2_b14:

2024-04-21 16:33:16.032 | INFO     | __main__:train_model:290 - Scale-Invariant Final test loss       | 4.4591
2024-04-21 16:33:16.049 | INFO     | __main__:train_model:292 - Final test SI d1         | 0.6159
2024-04-21 16:33:16.065 | INFO     | __main__:train_model:292 - Final test SI d2         | 0.8415
2024-04-21 16:33:16.086 | INFO     | __main__:train_model:292 - Final test SI d3         | 0.9220
2024-04-21 16:33:16.116 | INFO     | __main__:train_model:292 - Final test SI rmse       | 0.0915

dinov2_l14:

2024-04-21 18:31:51.140 | INFO     | __main__:train_model:290 - Scale-Invariant Final test loss       | 4.1440
2024-04-21 18:31:51.164 | INFO     | __main__:train_model:292 - Final test SI d1         | 0.6495
2024-04-21 18:31:51.185 | INFO     | __main__:train_model:292 - Final test SI d2         | 0.8603
2024-04-21 18:31:51.208 | INFO     | __main__:train_model:292 - Final test SI d3         | 0.9307
2024-04-21 18:31:51.231 | INFO     | __main__:train_model:292 - Final test SI rmse       | 0.0840

all of them are a bit worse than Table 2 in the paper. Any idea what could be missing?

For reference, this is the config it prints for dinov2_l14:

2024-04-21 17:33:14.088 | INFO     | __main__:train_model:226 - Config: 
 optimizer:
  probe_lr: 0.0005
  model_lr: 0.0
  n_epochs: 10
  warmup_epochs: 1.5
backbone:
  _target_: evals.models.dino.DINO
  dino_name: dinov2
  model_name: vitl14
  output: dense-cls
  layer: -1
  return_multilayer: true
dataset:
  _target_: evals.datasets.navi.NAVI
  path: /storage/user/hael/data/navi_v1
  image_mean: imagenet
  augment_train: true
  bbox_crop: true
  relative_depth: true
probe:
  _target_: evals.models.probes.DepthHead
  min_depth: 0.001
  max_depth: 10
  head_type: dpt
  prediction_type: bindepth
  hidden_dim: 512
  kernel_size: 3
system:
  random_seed: 8
  num_gpus: 1
  port: 12355
note: ''
batch_size: 8
mbanani commented 7 months ago

Hi everyone! Thanks for raising the issue. I suspect the difference is due to the batch size. I had trained those models on 4 gpus with 4x the batch size. While cleaning the code, I reverted to 1 gpu to make the code more usable; my apologies that this wasn't clear in the current code base.

I'll try to verify this once I can, but unfortunately, I won't have much time for the next 2 weeks. If someone wants to try it out, the training code should easily support DDP training by setting system.num_gpus=4.

Linusnie commented 7 months ago

@mbanani thanks for the info! In fact, I just trained the dinov2_l14 model with 4 GPUs and batch size 2 (so in total the same batch size as my other runs) and got results more similar to the paper

2024-04-21 23:52:17.015 | INFO     | __mp_main__:train_model:290 - Scale-Invariant Final test loss       | 3.0536
2024-04-21 23:52:17.035 | INFO     | __mp_main__:train_model:292 - Final test SI d1         | 0.7429
2024-04-21 23:52:17.061 | INFO     | __mp_main__:train_model:292 - Final test SI d2         | 0.9041
2024-04-21 23:52:17.082 | INFO     | __mp_main__:train_model:292 - Final test SI d3         | 0.9526
2024-04-21 23:52:17.099 | INFO     | __mp_main__:train_model:292 - Final test SI rmse       | 0.0669

I suspect the issue has to do with how DistribitedDataParallel merges the gradients, resulting in different behavior depending on n_gpus. As far as I know DistribitedDataParallel will compute the mean of the gradients from each gpu. This means multi-gpu won't be equivalent to single-GPU uneless your loss uses a mean-reduce over the batch dimension.

Since the depth loss has a square root after the mean calculation (here: https://github.com/mbanani/probe3d/blob/main/evals/utils/losses.py#L68) the merged gradients will differ depending on the number of gpus. Since sqrt((a+b)/2) != (sqrt(a)+sqrt(b))/2

mbanani commented 7 months ago

Thank you for running this comparison. That's a interesting catch regarding the loss.

Just to clarify my previous message, I was using 4 gpus with a batch size of 8 (total=32) for the original experiments. Once I get some time, I can add the mean-reduce and run some comparison to get a good sense of how they impact performance. I suspect they will change the absolute values but won't change the trends.