Closed alen-smajic closed 1 year ago
cc @rafaelpadilla would love it if you could take a look!
Hi @alen-smajic ,
Thank you for reporting this issue. :)
The code you showed is not working because of out_indices=(-2, -1)
. Try to replace it by:
backbone_config = FocalNetConfig(out_indices=(1,2,3,4))
For the backbone, dinov2 model is not supported. These are the supported ones: BitConfig
, ConvNextConfig
, ConvNextV2Config
, DinatConfig
, FocalNetConfig
, MaskFormerSwinConfig
, NatConfig
, ResNetConfig
, SwinConfig
, TimmBackboneConfig
.
Hi @rafaelpadilla ,
thanks for the quick help. You are totally right, the out_indices attribute was not correctly set.
I have in fact managed to attach a dinov2 backbone on the Mask2Former model and it seems to work :)
import requests
from PIL import Image
import torch
from transformers import (
AutoImageProcessor,
Dinov2Config,
Dinov2Model,
Mask2FormerConfig,
Mask2FormerForUniversalSegmentation
)
# Store Dinov2 weights locally
dinov2_backbone_model = Dinov2Model.from_pretrained("facebook/dinov2-base", out_indices=[6, 8, 10, 12])
torch.save(dinov2_backbone_model.state_dict(), "dinov2-base.pth")
# Create Mask2Former config with Dinov2 backbone
image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-tiny-cityscapes-semantic")
model_config = Mask2FormerConfig.from_pretrained("facebook/mask2former-swin-tiny-cityscapes-semantic")
model_config.backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_indices=(6, 8, 10, 12))
# Instantiate Mask2Former model with Dinov2 backbone (random weights)
model = Mask2FormerForUniversalSegmentation(model_config)
# Load Dinov2 weights into Mask2Former backbone
dinov2_backbone = model.model.pixel_level_module.encoder
dinov2_backbone.load_state_dict(torch.load("dinov2-base.pth"))
image_processor = AutoImageProcessor.from_pretrained("facebook/mask2former-swin-tiny-cityscapes-semantic")
url = (
"https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
)
image = Image.open(requests.get(url, stream=True).raw)
inputs = image_processor(image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
results = image_processor.post_process_semantic_segmentation(outputs, target_sizes=[image.size[::-1]])
Hi @alen-smajic ,
Glad to see the problem was solved :)
I will close this issue for now. Feel free to re-open it in case you encounter any related concerns in the future.
Hi @alen-smajic, thanks for the snipped, I managed to use Dinov2 as a backbone for Mask2Former. Did you try to finetune it on your own data? I am experiencing very low performance. Could the reason be that the authors of Dinov2 used ViT-Adapter? Every additional suggestion would be very appreciated :)
Hi @alen-smajic, thanks for the snipped, I managed to use Dinov2 as a backbone for Mask2Former. Did you try to finetune it on your own data? I am experiencing very low performance. Could the reason be that the authors of Dinov2 used ViT-Adapter? Every additional suggestion would be very appreciated :)
From their notebook here I see they use ViT-Adapter (see the model summary which has ViTAdapter).
Interested to know if anyone has implemented Mask2Former with Dinov2 and ViT-Adapter with Huffingface modules rather than mmseg.
System Info
transformers
version: 4.33.1Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I would like to combine the DINOv2 backbone model with the Mask2Former model for semantic segmentation. Even though the official documentation states that Mask2Former only works with a Swin Transformer backbone, I stumbled upon this issue #24244.
In PR #24532 multi-backbone support has been implemented by @amyeroberts , and some exemplary code has been provided. So far the model instantiation works, however when I try to infer any data into the model I get an error:
Script to reproduce:
The error I get:
Expected behavior
If working properly the code from above should output the model predictions (of class Mask2FormerModelOutput), which where produced by running the input image trough the new backbone and then forwarding the feature maps to the Mask2Former model.