Closed ZheningHuang closed 1 year ago
Hi, there are two steps.
First, we have the ground-truth phrase/object class in the annotations, so all we have to do is to locate it in the input utterance. We simply do string matching to find it. For SR3D this works perfectly well, for Nr3D and ScanRefer it works most of the times (>90%).
Then, based on the found alignments, we train a text classifier (one per dataset) to find the span given the utterance. This generates pseudo-ground-truth for all utterances. We then use these spans to train and test our model.
Hi, I have the same question about the problem.
You can actually find the code for this in src/text_cls.py.
Greetings and thank you for responding.
After reviewing the code, I must say that you did an excellent job!
I have a minor inquiry: What was the reason for adding "not mentioned" to the network input's utterance? Since "not mentioned" span is not assigned as GT, it's highly improbable that the network will ever predict it, i.e., whether you append "not mentioned" or not, the inference results will remain the same. I guess you did this to correspond with the input of BUTD-DETR?
Hi, "not-mentioned" is used to match all queries which do not match to any ground truths. For more details, please check Section-3.4 of the paper.
Hi, "not-mentioned" is used to match all queries which do not match to any ground truths. For more details, please check Section-3.4 of the paper.
Hi, I see that. But I am talking about src/text_cls.py, am I missing something?
Oh I see, right, there it's just done to match BUTD-DETR format.
Hi, I have a general question regarding the soft token prediction used.
I wonder how the ground truth of soft token predication is obtained. I know in the case of MDETR, this is obtained via a complicated data combination process, as most of dataset does not provide fine level alignment between each text token to boxes. I reckon in 3D this fine level alignment is also not easily obtained, so how this level of ground truth is obtain. I cannot find this part in the paper. Best, Zhening