Question about fine-tuning BeitForSemanticSegmentation model

KombangkoeDias commented 2 years ago

From the documentation, it says that the logits shape will be (batch_size, num_labels, height/4, width/4) I assume that the logits are the output masks of the model (since I'm doing the segmentation). How do I convert this shape (height /4, width /4) to the original image's shape before being resized to (height, width)?

-> # logits are of shape (batch_size, num_labels, height/4, width/4) I realized that the input image is resized to the shape (height, width) by BeiTFeatureExtractor object while the height and widthare those defined in the BeitFeatureExtractor's config and are constant values. This means the output values' shapes are not the original image's shape but are rather resized-shape/4

NielsRogge commented 2 years ago

Hi,

You can take a look at this notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/SegFormer/Fine_tune_SegFormer_on_custom_dataset.ipynb

It shows how to fine-tune SegFormer on a custom dataset (BeiT is equivalent), and it also outputs logits of shape (batch_size, num_labels, height/4, width/4). One interpolates these to the original size of the image.

Note that we are in the process of determining generic outputs for semantic segmentation models, so it might be that in the future, the logits will automatically have the same size as the original pixel_values.

KombangkoeDias commented 2 years ago

Thank you.

huggingface / transformers

Question about fine-tuning BeitForSemanticSegmentation model #15197