Regarding the evaluation metrics

soham-joshi commented 1 year ago

Hi @ayushjain1144 ,

While evaluating the model, the following analyses are available:

`Testing evaluation..................... [02/23 10:49:16 sr3d_cls_eval]: Eval: [1000/1478] [02/23 10:49:16 sr3d_cls_eval]: loss 6.7607 loss_bbox 0.8711 loss_ce 9.3744 loss_constrastive_align 31.2625 loss_giou 2.2085 query_points_generationloss 0.0022 last Box given span (soft-token) Acc: 0.682048967618188 last Box given span (contrastive) Acc: 0.6841927112715784 proposal Box given span (soft-token) Acc: 0.50462597314679 proposal Box given span (contrastive) Acc: 0.5195193501071872 0head Box given span (soft-token) Acc: 0.6286246192034299 0head Box given span (contrastive) Acc: 0.6341532212569108 1head Box given span (soft-token) Acc: 0.67149949227124 1head Box given span (contrastive) Acc: 0.6738688931513032 2head Box given span (soft-token) Acc: 0.6774794087780661 2head Box given span (contrastive) Acc: 0.6774794087780661 3head Box given span (soft-token) Acc: 0.6788897664447704 3head Box given span (contrastive) Acc: 0.68075143856482 4head Box given span (soft-token) Acc: 0.6806386099514837 4head_ Box given span (contrastive) Acc: 0.6831208394448832

Analysis easy 0.6998390989541432 hard 0.6474697885196374 vd 0.5186170212765957 vid 0.6915282196300224 unique 0.0 multi 0.6841927112715784`

The above log is extracted from the evaluation log of BUTD-DETR (cls mode) on the SR3D dataset. I understand the analyses: easy, hard, vd, and vid. But Tables 6 and 7 additionally have another metric "Overall (GT)". Could you also share the procedure for calculating the overall score, as I wasn't able to locate the code related to the same? Thanks!

ayushjain1144 commented 1 year ago

Hi, "overall" is last_ box given_span that's printed next to "soft-token" or "contrastive". We usually take the contrastive evaluation score

soham-joshi commented 1 year ago

Okay. Thanks for the clarification @ayushjain1144 !

soham-joshi commented 1 year ago

Hi @ayushjain1144, another doubt wrt Scanrefer benchmarking (refer to Table 8 in the paper). The following is the log extracted from the ScanRefer evaluation log:

`Test evaluation....... last Box given span (soft-token) Acc0.25: Top-1: 0.518, Top-5: 0.764, Top-10: 0.816 last Box given span (soft-token) Acc0.50: Top-1: 0.384, Top-5: 0.604, Top-10: 0.660 last Box given span (contrastive) Acc0.25: Top-1: 0.519, Top-5: 0.765, Top-10: 0.818 last Box given span (contrastive) Acc0.50: Top-1: 0.384, Top-5: 0.606, Top-10: 0.661 proposal Box given span (soft-token) Acc0.25: Top-1: 0.485, Top-5: 0.750, Top-10: 0.822 proposal Box given span (soft-token) Acc0.50: Top-1: 0.337, Top-5: 0.579, Top-10: 0.652 proposal Box given span (contrastive) Acc0.25: Top-1: 0.498, Top-5: 0.750, Top-10: 0.827 proposal Box given span (contrastive) Acc0.50: Top-1: 0.345, Top-5: 0.581, Top-10: 0.658 0head Box given span (soft-token) Acc0.25: Top-1: 0.508, Top-5: 0.763, Top-10: 0.827 0head Box given span (soft-token) Acc0.50: Top-1: 0.365, Top-5: 0.598, Top-10: 0.665 0head Box given span (contrastive) Acc0.25: Top-1: 0.511, Top-5: 0.761, Top-10: 0.824 0head Box given span (contrastive) Acc0.50: Top-1: 0.368, Top-5: 0.603, Top-10: 0.666 1head Box given span (soft-token) Acc0.25: Top-1: 0.516, Top-5: 0.765, Top-10: 0.823 1head Box given span (soft-token) Acc0.50: Top-1: 0.375, Top-5: 0.607, Top-10: 0.665 1head Box given span (contrastive) Acc0.25: Top-1: 0.517, Top-5: 0.766, Top-10: 0.820 1head Box given span (contrastive) Acc0.50: Top-1: 0.378, Top-5: 0.608, Top-10: 0.666 2head Box given span (soft-token) Acc0.25: Top-1: 0.517, Top-5: 0.765, Top-10: 0.815 2head Box given span (soft-token) Acc0.50: Top-1: 0.375, Top-5: 0.606, Top-10: 0.660 2head Box given span (contrastive) Acc0.25: Top-1: 0.520, Top-5: 0.765, Top-10: 0.816 2head Box given span (contrastive) Acc0.50: Top-1: 0.377, Top-5: 0.607, Top-10: 0.663 3head Box given span (soft-token) Acc0.25: Top-1: 0.518, Top-5: 0.765, Top-10: 0.816 3head Box given span (soft-token) Acc0.50: Top-1: 0.380, Top-5: 0.610, Top-10: 0.667 3head Box given span (contrastive) Acc0.25: Top-1: 0.522, Top-5: 0.765, Top-10: 0.817 3head Box given span (contrastive) Acc0.50: Top-1: 0.383, Top-5: 0.611, Top-10: 0.668 4head Box given span (soft-token) Acc0.25: Top-1: 0.518, Top-5: 0.765, Top-10: 0.817 4head Box given span (soft-token) Acc0.50: Top-1: 0.385, Top-5: 0.609, Top-10: 0.668 4head Box given span (contrastive) Acc0.25: Top-1: 0.519, Top-5: 0.766, Top-10: 0.820 4head Box given span (contrastive) Acc0.50: Top-1: 0.385, Top-5: 0.611, Top-10: 0.669

Analysis easy 0.7294164668265388 hard 0.4444761632886098 vd 0.4897959183673469 vid 0.5608159153865525 unique 0.8273431994362227 multi 0.4654469032018791`

The table in the paper (Table 8) presents the numbers for the evaluation Unique@0.25, Unique@0.5, Multi@0.25, Multi@0.5, Overall@0.25, and Overall@0.5. How do we see those results in the above log, could you please help me with this?

Thanks in advance!

ayushjain1144 commented 1 year ago

Hi, unique 0.8273431994362227 multi 0.4654469032018791 ^these are the results for unique and multi @0.25 threshholds.

you can get the results with 0.5 threshold by changing this line from: if k == 1 and t == self.thresholds[0]: to if k == 1 and t == self.thresholds[1]:

soham-joshi commented 1 year ago

Got it, thank you!

soham-joshi commented 1 year ago

Hey @ayushjain1144 I tried evaluation on SR3D (CLS) (@0.25 and 0.50 thresholds) using the approach suggested above. However, both of the evaluation logs are pretty much the same.

if k == 1 and t == self.thresholds[0]: `Testing evaluation..................... [03/02 11:05:46 root]: Eval: [1000/1478] [03/02 11:05:46 root]: loss 6.1318 loss_bbox 0.6459 loss_ce 7.4211 loss_constrastive_align 30.3128 loss_giou 1.8407 query_points_generationloss 0.0021 last Box given span (soft-token) Acc: 0.6717815638045809 last Box given span (contrastive) Acc: 0.6739817217646396 proposal Box given span (soft-token) Acc: 0.42564594381135057 proposal Box given span (contrastive) Acc: 0.4426830644251382 0head Box given span (soft-token) Acc: 0.44618075143856484 0head Box given span (contrastive) Acc: 0.4616382714656437 1head Box given span (soft-token) Acc: 0.6249012749633307 1head Box given span (contrastive) Acc: 0.6285682048967618 2head Box given span (soft-token) Acc: 0.6581857158975516 2head Box given span (contrastive) Acc: 0.6608936026176239 3head Box given span (soft-token) Acc: 0.6676633194178043 3head Box given span (contrastive) Acc: 0.6703147918312083 4head Box given span (soft-token) Acc: 0.6729662642446125 4head_ Box given span (contrastive) Acc: 0.6752228365113393

Analysis easy 0.7012067578439259 hard 0.610083081570997 vd 0.48138297872340424 vid 0.6825144338399906 unique 0.0 multi 0.6739817217646396`

if k == 1 and t == self.thresholds[1]: `Testing evaluation..................... [03/02 11:58:31 root]: Eval: [1000/1478] [03/02 11:58:31 root]: loss 6.1318 loss_bbox 0.6459 loss_ce 7.4211 loss_constrastive_align 30.3128 loss_giou 1.8407 query_points_generationloss 0.0021 last Box given span (soft-token) Acc: 0.6717815638045809 last Box given span (contrastive) Acc: 0.6739817217646396 proposal Box given span (soft-token) Acc: 0.42564594381135057 proposal Box given span (contrastive) Acc: 0.4426830644251382 0head Box given span (soft-token) Acc: 0.44618075143856484 0head Box given span (contrastive) Acc: 0.4616382714656437 1head Box given span (soft-token) Acc: 0.6249012749633307 1head Box given span (contrastive) Acc: 0.6285682048967618 2head Box given span (soft-token) Acc: 0.6581857158975516 2head Box given span (contrastive) Acc: 0.6608936026176239 3head Box given span (soft-token) Acc: 0.6676633194178043 3head Box given span (contrastive) Acc: 0.6703147918312083 4head Box given span (soft-token) Acc: 0.6729662642446125 4head_ Box given span (contrastive) Acc: 0.6752228365113393

Analysis easy 0.7012067578439259 hard 0.610083081570997 vd 0.48138297872340424 vid 0.6825144338399906 unique 0.0 multi 0.6739817217646396`

Am I doing anything incorrect, can you help me with this @ayushjain1144

ayushjain1144 commented 1 year ago

Hi, I think this is because you are evaluating in the CLS setting where the benchmark just wants us to select the correct ground truth box (and there is no IoU involvement in benchmark evaluation).

You will notice the difference only in DET setup between different IoU thresholds and thus you can change this line in GroundingEvaluator class which is responsible for evaluating the DET setup.

soham-joshi commented 1 year ago

Okay, thanks for the clarification! @ayushjain1144

mrsempress commented 3 months ago

Hi @ayushjain1144, I reproduction sh train_test_det.sh, but the results of unique is always 0.0. But I find some other paper, and the other people is not 0.0.

ayushjain1144 commented 3 months ago

Hi, are you adding --checkpoint_path flag with its argument to your downloaded checkpoint? (See readme)

If you are, maybe share the exact script you are running and we can see what's wrong.

mrsempress commented 3 months ago

@ayushjain1144 I'm sorry. I read from top to bottom according to the ReadME and only executed usage. I used the command sh train_test_det.sh and did not load the pre-trained ckpt. I will try it again with pretrained weights. Another question is, where does this pretrained ckpt come from? Can I train it myself? In addition, the default setting of the load weight is resume, and training starts directly from epoch 28.

ayushjain1144 commented 3 months ago

We trained the weights for our experiments and release it in the section "pre-trained weight".

Yes, you should be able to reproduce the same checkpoint using train_test_det.sh

We start the training from scratch, so you should not load the pre-trained weights if you are trying to reproduce the training (it's expected behaviour that the model would resume the training from the epoch of the checkpoint)

mrsempress commented 3 months ago

@ayushjain1144 Sorry, I think you misunderstood me.

When I used sh train_test_det.sh to reproduce your results, the results of unique is always 0.0. Then you prompted me to use --checkpoint_path that loads your weights. But your weights are obtained through sh train_test_det.sh training. It does not load the weights but resumes ckpt. What I understand about load weights are the same as the imagenet pre-training model, which initializes resnet and other backbone and epoch still starts from 0.
Back to the original question, unique is always 0.0. I don't think unique will change after resume weights, that's strange. Because it is a resume operation. If there is any part I understand wrong, please tell me.

ayushjain1144 commented 3 months ago

oh i see, sorry for the confusion. unique will always be 0 for SR3D/NR3D because these benchmarks do not have unique/multiple classification (instead they have easy/hard/view-dep/view-indep). ScanRefer has unique/multiple and you should see non-zero numbers for those. Refer to Table-6, 7 and 8 in our paper: https://arxiv.org/pdf/2112.08879.pdf

mrsempress commented 3 months ago

Thanks for your reply, now I understand it.

nickgkan / butd_detr

Regarding the evaluation metrics #12