nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Concerning the evaluation code #41

Closed Daniellli closed 11 months ago

Daniellli commented 11 months ago
          hi , i am little confusing with following lines:

https://github.com/nickgkan/butd_detr/blob/10570e0b6826d4a236b18c2c8fac5903866e1c60/src/grounding_evaluator.py#L197-L201

why the GT involved in parsing prediciton ?

_Originally posted by @Daniellli in https://github.com/nickgkan/butd_detr/issues/40#issuecomment-1695618035_

ayushjain1144 commented 11 months ago

Hi, positive_map comes from a model trained to predict spans -- not ground truth (Refer this line)

Daniellli commented 11 months ago

yeah, I see.

But, the input of this pretrained model includes the referred target name, does it make sense?

ayushjain1144 commented 11 months ago

No, that pretrained model just takes the text utterance as input (Here)

Daniellli commented 11 months ago

I am interested in understanding the generation process of the predicted positive map. I noticed that the values are constrained within the range of {0, 1}. Could you kindly provide insights into the methodology used to generate this data? I would greatly appreciate a detailed explanation of the process. Thank you.

ayushjain1144 commented 11 months ago

you can think of positive map as a probability distribution over the text tokens. Here is the relevant code to look at: https://github.com/nickgkan/butd_detr/blob/main/src/text_cls.py#L354-L381. For eg. if the sentence is " Basketball on sofa" and the root word is basketball, the positive map could look like [0.5, 0.5, 0.0, 0.0, ......, 0.0] assuming that the tokenizer split the basketball into two tokens "basket" and "ball" (and thus they divide among them the total weight of 1.0).

Sr3d/nr3d/scanrefer generally tell us the class of the ground truth object, and we use that and simple string matching to determine the location of root word in sentence. This is the relevant code: https://github.com/nickgkan/butd_detr/blob/main/src/text_cls.py#L304-L323. This becomes the ground truth for the span prediction model.

Daniellli commented 11 months ago

Hi, thank you for your details answers.

I found the value range of predicted span is {0,1}, namely, only 0 and 1 appeared, but the span prediction model is a regression model (https://github.com/nickgkan/butd_detr/blob/10570e0b6826d4a236b18c2c8fac5903866e1c60/src/text_cls.py#L394C1-L399C10). Could you share your post-process?

Moreover, this is such wonderful work, I got a lot of inspiration from it, especially the span preediction model. If possible, could you further share the pretrained span prediction model weight or the training process instruction? Both of them would be better.

thank you for your attention.

But I still have some more question

ayushjain1144 commented 11 months ago

I found the value range of predicted span is {0,1}, namely, only 0 and 1 appeared, but the span prediction model is a regression model: Most of them are 0/1 because the tokenizer didn't split the root word but I am very sure you would be able to find positive maps that have values other than 0/1.

Also, its not regression perse, the model is predicting logits and is trained with binary cross entropy with logits (that has a sigmoid internally): https://github.com/nickgkan/butd_detr/blob/10570e0b6826d4a236b18c2c8fac5903866e1c60/src/text_cls.py#L94-L96

This is the post-processing: https://github.com/nickgkan/butd_detr/blob/10570e0b6826d4a236b18c2c8fac5903866e1c60/src/text_cls.py#L113-L122; It is simply threshholding logits with 0 (since the range of model output is -inf to inf, values above 0 would have prob>0.5 if we apply sigmoid over model's output).

If possible, could you further share the pretrained span prediction model weight or the training process instruction?: Edit: We have the instructions to train span prediction in readme but not the weights. It should be straightforward to train it (within less than half an hour of training time)

Daniellli commented 10 months ago

hi, sorry for bothering you again. may i ask why use the predicted span not the GT span as supervise signal during the training process?

thank you for your time