[Depth Anything V2] Incorrect model loading for metric depth estimation models

bt2513 commented 3 weeks ago

System Info

transformers version: 4.44.0
Platform: macOS-14.6.1-arm64-arm-64bit
Python version: 3.12.3
Huggingface_hub version: 0.24.6
Safetensors version: 0.4.4
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.4.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed set-up

Who can help?

@amyeroberts @NielsRogge

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I just came back from vacation and noticed some issues raised with the Depth Anything V2 models (for metric depth estimation) on the hub. They seem to happen because when loading any of those models, the output activation function should be a Sigmoid but it is a ReLU instead:

# code to replicate the issue
from transformers import DepthAnythingForDepthEstimation

model = DepthAnythingForDepthEstimation.from_pretrained(
    "depth-anything/depth-anything-V2-metric-indoor-small-hf"
).to("cpu")

model

The model print is the following, and the output logits are incorrect because of the wrong activation function used (and it also looks like the output is not scaled by max_depth from what I observed when playing with a toy example):

DepthAnythingForDepthEstimation(
  (backbone): Dinov2Backbone(
    (embeddings): Dinov2Embeddings(
      (patch_embeddings): Dinov2PatchEmbeddings(
        (projection): Conv2d(3, 384, kernel_size=(14, 14), stride=(14, 14))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): Dinov2Encoder(
      (layer): ModuleList(
        (0-11): 12 x Dinov2Layer(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attention): Dinov2Attention(
            (attention): Dinov2SelfAttention(
              (query): Linear(in_features=384, out_features=384, bias=True)
              (key): Linear(in_features=384, out_features=384, bias=True)
              (value): Linear(in_features=384, out_features=384, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): Dinov2SelfOutput(
              (dense): Linear(in_features=384, out_features=384, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (layer_scale1): Dinov2LayerScale()
          (drop_path): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Dinov2MLP(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (activation): GELUActivation()
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
          )
          (layer_scale2): Dinov2LayerScale()
        )
      )
    )
    (layernorm): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
  )
  (neck): DepthAnythingNeck(
    (reassemble_stage): DepthAnythingReassembleStage(
      (layers): ModuleList(
        (0): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 48, kernel_size=(1, 1), stride=(1, 1))
          (resize): ConvTranspose2d(48, 48, kernel_size=(4, 4), stride=(4, 4))
        )
        (1): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1))
          (resize): ConvTranspose2d(96, 96, kernel_size=(2, 2), stride=(2, 2))
        )
        (2): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 192, kernel_size=(1, 1), stride=(1, 1))
          (resize): Identity()
        )
        (3): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
          (resize): Conv2d(384, 384, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
        )
      )
    )
    (convs): ModuleList(
      (0): Conv2d(48, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): Conv2d(96, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (2): Conv2d(192, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (3): Conv2d(384, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    )
    (fusion_stage): DepthAnythingFeatureFusionStage(
      (layers): ModuleList(
        (0-3): 4 x DepthAnythingFeatureFusionLayer(
          (projection): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
          (residual_layer1): DepthAnythingPreActResidualLayer(
            (activation1): ReLU()
            (convolution1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (activation2): ReLU()
            (convolution2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (residual_layer2): DepthAnythingPreActResidualLayer(
            (activation1): ReLU()
            (convolution1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (activation2): ReLU()
            (convolution2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
      )
    )
  )
  (head): DepthAnythingDepthEstimationHead(
    (conv1): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (activation1): ReLU()
    (conv3): Conv2d(32, 1, kernel_size=(1, 1), stride=(1, 1))
    (activation2): ReLU()
  )
)

Expected behavior

The output activation function should be a Sigmoid at the end for activation2 and the logits should be scaled by max_depth correctly:

DepthAnythingForDepthEstimation(
  (backbone): Dinov2Backbone(
    (embeddings): Dinov2Embeddings(
      (patch_embeddings): Dinov2PatchEmbeddings(
        (projection): Conv2d(3, 384, kernel_size=(14, 14), stride=(14, 14))
      )
      (dropout): Dropout(p=0.0, inplace=False)
    )
    (encoder): Dinov2Encoder(
      (layer): ModuleList(
        (0-11): 12 x Dinov2Layer(
          (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (attention): Dinov2Attention(
            (attention): Dinov2SelfAttention(
              (query): Linear(in_features=384, out_features=384, bias=True)
              (key): Linear(in_features=384, out_features=384, bias=True)
              (value): Linear(in_features=384, out_features=384, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
            (output): Dinov2SelfOutput(
              (dense): Linear(in_features=384, out_features=384, bias=True)
              (dropout): Dropout(p=0.0, inplace=False)
            )
          )
          (layer_scale1): Dinov2LayerScale()
          (drop_path): Identity()
          (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
          (mlp): Dinov2MLP(
            (fc1): Linear(in_features=384, out_features=1536, bias=True)
            (activation): GELUActivation()
            (fc2): Linear(in_features=1536, out_features=384, bias=True)
          )
          (layer_scale2): Dinov2LayerScale()
        )
      )
    )
    (layernorm): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
  )
  (neck): DepthAnythingNeck(
    (reassemble_stage): DepthAnythingReassembleStage(
      (layers): ModuleList(
        (0): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 48, kernel_size=(1, 1), stride=(1, 1))
          (resize): ConvTranspose2d(48, 48, kernel_size=(4, 4), stride=(4, 4))
        )
        (1): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 96, kernel_size=(1, 1), stride=(1, 1))
          (resize): ConvTranspose2d(96, 96, kernel_size=(2, 2), stride=(2, 2))
        )
        (2): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 192, kernel_size=(1, 1), stride=(1, 1))
          (resize): Identity()
        )
        (3): DepthAnythingReassembleLayer(
          (projection): Conv2d(384, 384, kernel_size=(1, 1), stride=(1, 1))
          (resize): Conv2d(384, 384, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
        )
      )
    )
    (convs): ModuleList(
      (0): Conv2d(48, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (1): Conv2d(96, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (2): Conv2d(192, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (3): Conv2d(384, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    )
    (fusion_stage): DepthAnythingFeatureFusionStage(
      (layers): ModuleList(
        (0-3): 4 x DepthAnythingFeatureFusionLayer(
          (projection): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
          (residual_layer1): DepthAnythingPreActResidualLayer(
            (activation1): ReLU()
            (convolution1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (activation2): ReLU()
            (convolution2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (residual_layer2): DepthAnythingPreActResidualLayer(
            (activation1): ReLU()
            (convolution1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
            (activation2): ReLU()
            (convolution2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
      )
    )
  )
  (head): DepthAnythingDepthEstimationHead(
    (conv1): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (conv2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (activation1): ReLU()
    (conv3): Conv2d(32, 1, kernel_size=(1, 1), stride=(1, 1))
    (activation2): Sigmoid()
  )
)

It looks like the model is incorrectly initialized despite having the expected model config and the attributes max_depth and depth_estimation_type correctly set. The updated DepthAnythingDepthEstimationHead code in modeling_depth_anything doesn't seem applied for some reason during model initialization - or maybe the issue lies elsewhere. Any thoughts why this is happening @amyeroberts @NielsRogge? Please let me know your thoughts and happy to open a PR to make a fix!

qubvel commented 3 weeks ago

Hi @bt2513, thanks for the issue, I suppose the feature "metric depth" for DepthAnything is not in the 4.44.0 release, but you can use it if update transformers to the latest main.

pip install git+https://github.com/huggingface/transformers

bt2513 commented 3 weeks ago

Thanks @qubvel , it does work when I install transformers this way indeed! When is the feature "metric depth" for DepthAnything going to be released? Is there something I can do to help with this? Just asking because it seems that if people try to use the model by following the current instructions on the model card, they will run into the same issue as me

qubvel commented 3 weeks ago

Thanks for your help, we just have to wait a bit, in a week or two there will be a new 4.45.0 release with this feature included! One more thing we can do is add information regarding the minimum transformers version to a model card on the Hub, would you like to open a PR on the Hub?

bt2513 commented 3 weeks ago

Sure! Would you have an example of model card showing this that I can take as an example?

Or should I simply edit the model card title by something like Depth Anything V2 (Fine-tuned for Metric Depth Estimation) - Transformers Version (4.45.0+) for instance

Or alternatively have a small text below the title such as: "Prerequisite: transformers version 4.45.0 or later. Alternatively, use transformers latest main with pip install git+https://github.com/huggingface/transformers"

qubvel commented 3 weeks ago

@bt2513 yes, it should be good with something like

...

## Requirements: 

`transformers>=4.45.0` 

Alternatively, use `transformers` latest version installed from the source:

\```
pip install git+https://github.com/huggingface/transformers
\```

...

(without "\")

huggingface / transformers