Possible mistake in ava evaluation

gurkirt commented 2 years ago

STAR; Done!

Checklist

I have searched related issues but haven't found a relevant issue.
The bug has not been fixed in the latest version.

Done!

Describe the bug

This is not an error but an evaluation bug, which results in higher mAP than it should.

The current codebase uses the ground-truth validation file to build the dataset, for evaluation of AVA, which I think is wrong or at least according to the original Slowfast codebase.

More precisely as follows:

Current member function load_annotations of AVADataset use ground truth file to build self.video_info field, which is used to loop over the dataset. But, that could be wrong, it means it is using ground information (which key-frames contains ground truth) to run the validation-set prediction file generation. Ideally, as is in PySlowfast, we should use on;y proposals to build validation dataset, original FAIR proposal, as used by this codebase have 55466 keyframes, however, an ultimate number of keyframes depends upon the person_det_score_thr, if set to 0.9, as currently used, we will have 51635 keyframes, compared to ground truth 50252 keyframes and all proposal keyframes of 55466.

Finally, given the provided checkpoint for R50_based_config, I got 26.27 mAP given the current setup compared to 26.4 which could be because of environmental differences, I created a separate issue regarding it. After correcting for this ground truth bias, i.e. using proposals to build video_info in AVA_dataset class, I got 24.81 mAP, which is far lower than reported.

I believe I am correct here because R50 based model in PySlowfast also achieves 24.7 mAP. I thought you guys might have a better training setup in this codebase, as a result, you have a much better and faster (only 10 epochs) R50 based slowfast baseline. I would love to be wrong so that I have a better baseline to work with, but I am afraid I might be right, if I am right then we can not use the generated number from the current state of baseline to compare with other papers.

Looking forward to your response.

To correct the bias, I load the proposal first then call super().init() in AVA dataset. Then I made the following changes to the load_annotations function. The first part of the function is the same it the last part where video_info is being built that is using proposal info. Of course, changing the value of person_det_score_thr could result in better numbers.

    def load_annotations(self):

        self.logger.info("Start loading AVA annotations from "+ self.ann_file)

        video_infos = []
        records_dict_by_img = defaultdict(list)
        cc = 0
        with open(self.ann_file, 'r') as fin:
            for line in fin:
                line_split = line.strip().split(',')
                # cc += 1
                # if cc>300:
                #     break
                label = int(line_split[6])
                if self.custom_classes is not None:
                    if label not in self.custom_classes:
                        continue
                    label = self.custom_classes.index(label)

                video_id = line_split[0]
                timestamp = int(line_split[1])
                img_key = f'{video_id},{timestamp:04d}'

                entity_box = np.array(list(map(float, line_split[2:6])))
                entity_id = int(line_split[7])
                shot_info = (0, (self.timestamp_end - self.timestamp_start) *
                             self._FPS)

                video_info = dict(
                    video_id=video_id,
                    timestamp=timestamp,
                    entity_box=entity_box,
                    label=label,
                    entity_id=entity_id,
                    shot_info=shot_info)
                records_dict_by_img[img_key].append(video_info)

        self.logger.info('Done loading annoations: Loaded keyframes are : ' + str(len(records_dict_by_img)))
        if not self.test_mode or self.proposals is None:
            for img_key in records_dict_by_img:
                video_id, timestamp = img_key.split(',')
                bboxes, labels, entity_ids = self.parse_img_record(
                    records_dict_by_img[img_key])
                ann = dict(
                    gt_bboxes=bboxes, gt_labels=labels, entity_ids=entity_ids)
                frame_dir = video_id
                if self.data_prefix is not None:
                    frame_dir = osp.join(self.data_prefix, frame_dir)
                video_info = dict(
                    frame_dir=frame_dir,
                    video_id=video_id,
                    timestamp=int(timestamp),
                    img_key=img_key,
                    shot_info=shot_info,
                    fps=self._FPS,
                    ann=ann)
                video_infos.append(video_info)
        else:
            for img_key in self.proposals:
                video_id, timestamp = img_key.split(',')
                if img_key in self.proposals:
                    proposals = self.proposals[img_key]
                    nump = np.sum(proposals[:, 4]>=self.person_det_score_thr)
                    isproposals = True if nump>0 else False
                else:
                    isproposals = False

                if not isproposals:
                    continue

                if img_key in records_dict_by_img:
                    bboxes, labels, entity_ids = self.parse_img_record(records_dict_by_img[img_key])
                else:
                    bboxes = np.array([[0, 0, 1, 1]])
                    labels = np.zeros(1)
                    entity_ids = np.zeros(1)

                ann = dict(gt_bboxes=bboxes, gt_labels=labels, entity_ids=entity_ids)
                frame_dir = video_id
                if self.data_prefix is not None:
                    frame_dir = osp.join(self.data_prefix, frame_dir)
                video_info = dict(
                    frame_dir=frame_dir,
                    video_id=video_id,
                    timestamp=int(timestamp),
                    img_key=img_key,
                    shot_info=shot_info,
                    fps=self._FPS,
                    ann=ann)
                video_infos.append(video_info)

        return video_infos

kennymckormick commented 2 years ago

Thanks for pointing out this problem, we will first check the difference between SlowFast and our implementations, and fix the problem ASAP.

kennymckormick commented 2 years ago

Hey gurkirt,

After checking the AVA validation annotations, I think our evaluation protocol can be more reasonable. Some keys do not present in the validation annotations due to the missing annotations, for example, the key `9F2voT6QWvQ,1519' is not present in the validation set, but there is a person standing in that frame. Thus I think it's OK to ignore all those missing keys in both training and testing.

@gurkirt

gurkirt commented 2 years ago

I can understand but your logic which is absolutely right in some cases. But for comparison with existing works which uses FAIR proposals as starting point and Pyslowfast as a development choice, it is not fair to compare with given your evaluation setup.

I know this is not a nice solution but for a fair comparison with previous works, it is necessary to follow the flawed setup that PySlowfast uses, even though, that completely ignores the obvious flaws in annotations of the AVA dataset, which are apparent in validation setup, like you pointed out `9F2voT6QWvQ,1519'. This happens a lot in AVA. But, what can we do?

kennymckormick commented 2 years ago

Hi, gurkirt,

There are 4 cases in total for the AVA train validation: you can choose to ignore (or not ignore) the keyframes without annotations in training & validation (2x2). We ignore them in both training and validation, while SlowFast doesn't ignore them in both training and validation. A good solution might be: we can provide options to not ignore these keyframes when building the dataset, we will take time to develop this and provide the corresponding checkpoints.

open-mmlab / mmaction2

Possible mistake in ava evaluation #1268