microsoft / X-Decoder

[CVPR 2023] Official Implementation of X-Decoder for generalized decoding for pixel, image and language
Apache License 2.0
1.29k stars 134 forks source link

The loss of referring segmentation #8

Closed jshilong closed 1 year ago

jshilong commented 1 year ago

Thanks for the great work,

In section 4.1, you mentioned that the model was pre-trained on "panoramic segmentation, image-text" pairs (itp), and referring segmentation.I can't find the details of how you useReferring Segmentationdata in 3.4, would you mind providing more details aboutReferring Segmentation` data loss in the pre-training phase? or did I miss it?

Thanks

jshilong commented 1 year ago

It seems it is essentially a binary classification problem

MaureenZOU commented 1 year ago

Thanks for your interest in our work, and for bringing up the problem that we do not give details for referring segmentation.

  1. Data preparation: We use all the seg-text pairs from refcoco(g/+) dataset and exclude the validation set. In addition, those images that do not have referring seg ground truth, we use instance segmentation as labels (e.g. person -> all person instance).
  2. Loss Function: For each image with ground truth, we do Hungarian matching between prediction and ground truth. Only text to image loss is applied on referring segmentation. For each text, we train the highest score mask prosal to the ground truth.
yxchng commented 1 year ago

@MaureenZOU 1.Just to clarify. Does "refcoco(g/+)" mean only refcoco+ and refcocog, with refcoco excluded?

  1. What does this mean "In addition, those images that do not have referring seg ground truth, we use instance segmentation as labels (e.g. person -> all person instance)"?