Choose Validation code from MDETR repo and adjust input data accordingly

shivahanifi commented 10 months ago

Check the instructions on the MDETR repo to find the proper validation code.

Adapt the test data to the evaluation code
Make sure to use the right model (how to call the model, etc.)

shivahanifi commented 9 months ago

There are 3 different validation codes available in the MDETR repo.

Parameters/Evaluation code	eval_clevr.py	eval_gqa.py	eval_lvis.py
Notes	This script allows to dump the model's prediction on an arbitrary split of CLEVR/CoGenT/CLEVR-Humans	evaluates the model's predictions on the GQA (Visual Question Answering) dataset and dump the results to a file.	evaluation script for performing object detection evaluation on the LVIS (Large Vocabulary Instance Segmentation) dataset

The initial analysis shows that it is better to use the eval_lvis.py as it is indicated for detection. However, the others are mostly for visual question answering and the evaluation metrics are based on the type of the answers, etc.

Actions to be taken

Create an specific python script for the test set as the files in ./GazeMDETR/datasets . The created file can follow the structure of flicker.py or vg.py.
In the evaluation code disable the training part and make it --resume from the model used in the demo code.

shivahanifi commented 9 months ago

using the GazeMDETR_eval.py file try to calculate the iou and recall. (hints from flickr_eval.py)

added the iou calculation to the demo code
- Find out how to save them separately and also compute the whole iou accuracy by specifying the iou thresh

shivahanifi commented 9 months ago

important parts from the paper:

We evaluate our method on 4 downstream tasks: referring expression comprehension and segmentation, visual question answering and phrase grounding
- Phrase grounding: Given one or more phrases, which may be inter-related, the task is to provide a set of bounding boxes for each phrase. We use the Flickr30k entities dataset for this task, with the train/val/test splits and evaluate our performance in terms of Recall@k. For each sentence in the test set, we predict 100 bounding boxes and use the soft token alignment prediction to rank the boxes according to the score given to the token positions that correspond to the phrase. We evaluate under two protocols which we name ANY-BOX and MERGED-BOXES.
- Referring expression comprehension: Given an image and a referring expression in plain text, the task is to localize the object being referred to by returning a bounding box around it. we train our model to directly predict the bounding box, given a referring expression and the associated image. There are three established datasets for this task called RefCOCO, RefCOCO+ and RefCOCOg. Since during pre-training we annotate every object referred to within the text, there is a slight shift in the way the model is used in this task. For example, during pre-training, given the caption “The woman wearing a blue dress standing next to the rose bush.”, MDETR would be trained to predict boxes for all referred objects such as the woman, the blue dress and the rose bush. However, for referring expressions, the task would be to only return one bounding box, which signifies the woman being referred to by the entire expression. For this reason, we finetune the model on the task-specific. dataset for 5 epochs
- The main evaluation metric proposed to evaluate grounded detection in datasets like Flickr30K entities is Recall@k, that is measuring whether a given model is able to rank the “correct” box amongst the top k it produces. The correctness of a box is decided by computing the Intersection-over-Union (IoU) between the proposed box and the ground-truth box, and deemed correct if the IoU is above a predetermined threshold, generally 0.5. While this kind of evaluation is well-suited for tasks where there is a clear one-to-one mapping between phrases and boxes, for example in Referring Expression Comprehension tasks, we argue that in general grounded detection tasks, they fall short of properly evaluating the performance of the models.

shivahanifi commented 9 months ago

Evaluation metric to be used for GazeMDETR

Considering the information mentioned above, the recall metric will be used. However, since with the GazeMDETR we aim to have only one bounding box as an output (choosing the one with the highest confidence level) only recall@1 will be considered with the adjustable IoU threshold.

For the recall, the formula implies: $$Recall = {TP \over {TP + FN}} = {TP \over {all ground truth}}$$

Since the label for the detection is chosen from the given prompt, and considering the prompt categories we have with only mentioning one object in the sentence, there would be no cases with wrong labels. Moreover, since the model will always provide at least one prediction, the probability of having no bounding box is also eliminated. Therefore, the false negatives can be considered as the cases with IoU less than threshold_IOU.

QUESTION: isn't recall equally to precision in this case?

from the flickr_eval.py the computations for recall@k can be extracted:

            ious = box_iou(np.asarray(cur_boxes), np.asarray(target_boxes))
            for k in self.topk:
                maxi = 0
                if k == -1:
                    maxi = ious.max()
                else:
                    assert k > 0
                    maxi = ious[:k].max()
                if maxi >= self.iou_thresh:
                    recall_tracker.add_positive(k, "all")
                    for phrase_type in phrase["phrase_type"]:
                        recall_tracker.add_positive(k, phrase_type)
                else:
                    recall_tracker.add_negative(k, "all")
                    for phrase_type in phrase["phrase_type"]:
                        recall_tracker.add_negative(k, phrase_type)

shivahanifi commented 9 months ago

Code for evaluation

Since there are modifications applied to the demo code to run it on the test data, the evaluation is integrated also in the demo code, using different flags for different use cases.

parser = argparse.ArgumentParser(description='Caption format selection and evaluation selection')
parser.add_argument('-cc', '--caption_category', type=str, choices=['A', 'B', 'C', 'D', 'E'], default='A', help='Specify a value (A, B, C, D, E) to determine the caption category. A:The, B:This is a, C:Look at the, D:Point at the, E:Pass the')
parser.add_argument('-cd', '--captrrrion_details', type=int, choices=[1, 2, 3, 4], default=1, help='Specify a detail level as (1, 2, 3, 4) to determine the caption details. 1:pose+color+name+placement, 2:pose+name+placement, 3:color+name, 4:name')
parser.add_argument('-eval', '--evaluate', type=bool, default=True, help='Specify if you want to evaluate the output in terms of iou')
parser.add_argument('-sf', '--save_figures', type=bool, default=True, help='Specify if you want to save the generated figures for heatmaps and final selections')
parser.add_argument('-vf', '--visualize_figures', type=bool, default=True, help='Specify if you want to visualize the generated figures for heatmaps and final selections')
parser.add_argument('-iou', '--iou_thresh', type=float, default=0.5, help='Specify the IoU threshold for the evaluations')
args=parser.parse_args()

All the required functions are retrieved from flickr_eval.py, modified for the needs of GazeMDETR and collected in GazeMDETR_eval_util.py

shivahanifi / GazeMDETR