nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Soft token prediction #7

Closed ZheningHuang closed 1 year ago

ZheningHuang commented 1 year ago

Hi, I have a general question regarding the soft token prediction used.

I wonder how the ground truth of soft token predication is obtained. I know in the case of MDETR, this is obtained via a complicated data combination process, as most of dataset does not provide fine level alignment between each text token to boxes. I reckon in 3D this fine level alignment is also not easily obtained, so how this level of ground truth is obtain. I cannot find this part in the paper. Best, Zhening

nickgkan commented 1 year ago

Hi, there are two steps.

First, we have the ground-truth phrase/object class in the annotations, so all we have to do is to locate it in the input utterance. We simply do string matching to find it. For SR3D this works perfectly well, for Nr3D and ScanRefer it works most of the times (>90%).

Then, based on the found alignments, we train a text classifier (one per dataset) to find the span given the utterance. This generates pseudo-ground-truth for all utterances. We then use these spans to train and test our model.

Hiusam commented 1 year ago

Hi, I have the same question about the problem.

  1. The positive map stores only the span of the target object, no anchor objects. Is it right?
  2. The procedure of the target span generation is:
    • construct pairs of utterance <-> span distribution (256-dim logits) using string matching with the ground-truth class label of the target and the utterance. (I guess you must only use the training split)
    • use these pairs to train a text classifier (the input is an utterance, the output is the 256-dim logits)
    • use this text classifier to generate the span distribution for all the data again (including training and testing split)
    • use these generated span distributions (the "sr3d_pred_spans.json" file) for both training and testing in your methods.
      1. If I am correct in "2", the accuracy of the target span distribution is an important factor for your model's performance.
nickgkan commented 1 year ago

You can actually find the code for this in src/text_cls.py.

  1. Yes. Although for SR3D, because it's synthetic and string matching works perfectly, we can apply a very simple heuristic to find the anchor span as well.
  2. Pairs of utterance <-> distribution over actual tokens. The rest is correct.
  3. The accuracy is almost 100%. It is measured on the val set and only for the sentences that we can apply string matching to find the target span, which is actually the majority of utterances.
Hiusam commented 1 year ago

Greetings and thank you for responding.

After reviewing the code, I must say that you did an excellent job!

I have a minor inquiry: What was the reason for adding "not mentioned" to the network input's utterance? Since "not mentioned" span is not assigned as GT, it's highly improbable that the network will ever predict it, i.e., whether you append "not mentioned" or not, the inference results will remain the same. I guess you did this to correspond with the input of BUTD-DETR?

ayushjain1144 commented 1 year ago

Hi, "not-mentioned" is used to match all queries which do not match to any ground truths. For more details, please check Section-3.4 of the paper.

Hiusam commented 1 year ago

Hi, "not-mentioned" is used to match all queries which do not match to any ground truths. For more details, please check Section-3.4 of the paper.

Hi, I see that. But I am talking about src/text_cls.py, am I missing something?

ayushjain1144 commented 1 year ago

Oh I see, right, there it's just done to match BUTD-DETR format.