Closed gheinrich closed 1 week ago
Ah, do I need to use navi_reldepth
?
Using dinov2_b14
this gives me this, which is close (though slightly below) to the results from the paper:
2024-04-17 06:09:04.545 | INFO | __main__:train_model:285 - Scale-Invariant Final test loss | 4.6468
2024-04-17 06:09:04.545 | INFO | __main__:train_model:287 - Final test SI d1 | 0.5987
2024-04-17 06:09:04.545 | INFO | __main__:train_model:287 - Final test SI d2 | 0.8322
2024-04-17 06:09:04.545 | INFO | __main__:train_model:287 - Final test SI d3 | 0.9159
2024-04-17 06:09:04.546 | INFO | __main__:train_model:287 - Final test SI rmse | 0.0948
Hi, I am facing similar issues. Here is what I got training the probes on dino_b16
, dinov2_b14
and dinov2_l14
all with dataset=navi_reldepth
dino_b16:
2024-04-19 19:38:14.764 | INFO | __main__:train_model:290 - Scale-Invariant Final test loss | 5.9778
2024-04-19 19:38:14.787 | INFO | __main__:train_model:292 - Final test SI d1 | 0.4684
2024-04-19 19:38:14.827 | INFO | __main__:train_model:292 - Final test SI d2 | 0.7336
2024-04-19 19:38:14.858 | INFO | __main__:train_model:292 - Final test SI d3 | 0.8570
2024-04-19 19:38:14.889 | INFO | __main__:train_model:292 - Final test SI rmse | 0.1304
dinov2_b14:
2024-04-21 16:33:16.032 | INFO | __main__:train_model:290 - Scale-Invariant Final test loss | 4.4591
2024-04-21 16:33:16.049 | INFO | __main__:train_model:292 - Final test SI d1 | 0.6159
2024-04-21 16:33:16.065 | INFO | __main__:train_model:292 - Final test SI d2 | 0.8415
2024-04-21 16:33:16.086 | INFO | __main__:train_model:292 - Final test SI d3 | 0.9220
2024-04-21 16:33:16.116 | INFO | __main__:train_model:292 - Final test SI rmse | 0.0915
dinov2_l14:
2024-04-21 18:31:51.140 | INFO | __main__:train_model:290 - Scale-Invariant Final test loss | 4.1440
2024-04-21 18:31:51.164 | INFO | __main__:train_model:292 - Final test SI d1 | 0.6495
2024-04-21 18:31:51.185 | INFO | __main__:train_model:292 - Final test SI d2 | 0.8603
2024-04-21 18:31:51.208 | INFO | __main__:train_model:292 - Final test SI d3 | 0.9307
2024-04-21 18:31:51.231 | INFO | __main__:train_model:292 - Final test SI rmse | 0.0840
all of them are a bit worse than Table 2 in the paper. Any idea what could be missing?
For reference, this is the config it prints for dinov2_l14
:
2024-04-21 17:33:14.088 | INFO | __main__:train_model:226 - Config:
optimizer:
probe_lr: 0.0005
model_lr: 0.0
n_epochs: 10
warmup_epochs: 1.5
backbone:
_target_: evals.models.dino.DINO
dino_name: dinov2
model_name: vitl14
output: dense-cls
layer: -1
return_multilayer: true
dataset:
_target_: evals.datasets.navi.NAVI
path: /storage/user/hael/data/navi_v1
image_mean: imagenet
augment_train: true
bbox_crop: true
relative_depth: true
probe:
_target_: evals.models.probes.DepthHead
min_depth: 0.001
max_depth: 10
head_type: dpt
prediction_type: bindepth
hidden_dim: 512
kernel_size: 3
system:
random_seed: 8
num_gpus: 1
port: 12355
note: ''
batch_size: 8
Hi everyone! Thanks for raising the issue. I suspect the difference is due to the batch size. I had trained those models on 4 gpus with 4x the batch size. While cleaning the code, I reverted to 1 gpu to make the code more usable; my apologies that this wasn't clear in the current code base.
I'll try to verify this once I can, but unfortunately, I won't have much time for the next 2 weeks. If someone wants to try it out, the training code should easily support DDP training by setting system.num_gpus=4
.
@mbanani thanks for the info! In fact, I just trained the dinov2_l14
model with 4 GPUs and batch size 2 (so in total the same batch size as my other runs) and got results more similar to the paper
2024-04-21 23:52:17.015 | INFO | __mp_main__:train_model:290 - Scale-Invariant Final test loss | 3.0536
2024-04-21 23:52:17.035 | INFO | __mp_main__:train_model:292 - Final test SI d1 | 0.7429
2024-04-21 23:52:17.061 | INFO | __mp_main__:train_model:292 - Final test SI d2 | 0.9041
2024-04-21 23:52:17.082 | INFO | __mp_main__:train_model:292 - Final test SI d3 | 0.9526
2024-04-21 23:52:17.099 | INFO | __mp_main__:train_model:292 - Final test SI rmse | 0.0669
I suspect the issue has to do with how DistribitedDataParallel
merges the gradients, resulting in different behavior depending on n_gpus. As far as I know DistribitedDataParallel
will compute the mean of the gradients from each gpu. This means multi-gpu won't be equivalent to single-GPU uneless your loss uses a mean-reduce over the batch dimension.
Since the depth loss has a square root after the mean calculation (here: https://github.com/mbanani/probe3d/blob/main/evals/utils/losses.py#L68) the merged gradients will differ depending on the number of gpus. Since sqrt((a+b)/2) != (sqrt(a)+sqrt(b))/2
Thank you for running this comparison. That's a interesting catch regarding the loss.
Just to clarify my previous message, I was using 4 gpus with a batch size of 8 (total=32) for the original experiments. Once I get some time, I can add the mean-reduce and run some comparison to get a good sense of how they impact performance. I suspect they will change the absolute values but won't change the trends.
Hello, thank you very much for this excellent work!
I am trying to reproduce the paper results on Depth Estimation using NAVI and DINO B16:
This is different from the corresponding row in Table 2 of the paper:
For example NAVI SI RMSE for DINO B16 is 0.1043 in the paper, vs. 0.0138 in my repo.
Am I misinterpreting the results?
Thank you!