nickgkan / butd_detr

Code for the ECCV22 paper "Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds"
Other
74 stars 11 forks source link

Word_idx and Token_idx #17

Closed soham-joshi closed 1 year ago

soham-joshi commented 1 year ago

Hi @ayushjain1144 Could you clarify what the numpy arrays word_idx and token_idx in train_dist_mod.py are created for? reference: line 205

Thanks!

soham-joshi commented 1 year ago

I had a query, what does end_points['last_sem_cls_scores'] (dim=(B, 256, 256) ) in the same function represent?

nickgkan commented 1 year ago

We contrast each token in the sentence with each query. The predictions for one piece of text are the predictions of the queries, with confidence equal to the similarity of the projected queries and respective tokens.

For object detection prompts, we collect class scores and predictions by looping over the classes (word_idx) and aggregating the scores of the corresponding tokens (token_idx).

Then we use these to fill the semantic scores for the last prediction head, which are fed to the evaluator.

soham-joshi commented 1 year ago

Okay Moreover, what do the second and third dimensions represent in end_points['last_sem_cls_scores']? Do they represent logits over a set of (256?) classes?

ayushjain1144 commented 1 year ago

the 2nd dimension represents the number of queries, and the third dimension is logits over the 256 tokens in a sentence span. Each query predicts a distribution over the sentence span.

soham-joshi commented 1 year ago

Thanks for the responses @ayushjain1144 @nickgkan !