luo3300612 / image-captioning-DLCT

Official pytorch implementation of paper "Dual-Level Collaborative Transformer for Image Captioning" (AAAI 2021).
BSD 3-Clause "New" or "Revised" License
193 stars 31 forks source link

Should ['%d_size' % image_id] be equal to the real image size? #11

Open hcl14 opened 3 years ago

hcl14 commented 3 years ago

For 418x449 image I have [600,644] written to the features, and for 1042x480 I have [1000x461]. Those are not the values for some other images from my dataset, as I don't have images with such dimensions. Is it Ok, are those values somewhat different or they must correspond to the real dimensions?

hcl14 commented 3 years ago

UPD: I found that the change of image size happens in this place of dataset_mapper.py:

    def __call__(self, dataset_dict):
        dataset_dict = copy.deepcopy(dataset_dict)
        image = utils.read_image(dataset_dict["file_name"], format=self.img_format)
        utils.check_image_size(dataset_dict, image)

        sh1 =  image.shape

        if "annotations" not in dataset_dict:
            image, transforms = T.apply_transform_gens(
                ([self.crop_gen] if self.crop_gen else []) + self.tfm_gens, image
            )
        else:
            if self.crop_gen:
                crop_tfm = utils.gen_crop_transform_with_instance(
                    self.crop_gen.get_crop_size(image.shape[:2]),
                    image.shape[:2],
                    np.random.choice(dataset_dict["annotations"]),
                )
                image = crop_tfm.apply_image(image)
            image, transforms = T.apply_transform_gens(self.tfm_gens, image)
            if self.crop_gen:
                transforms = crop_tfm + transforms

        image_shape = image.shape[:2]
        print("ttttttttt", sh1, image_shape)

Output:

ttttttttt (480, 640, 3) (600, 800)
ttttttttt (418, 449, 3) (600, 644)
ttttttttt (1042, 480, 3) (1000, 461)
ttttttttt (667, 500, 3) (800, 600)
ttttttttt (700, 1050, 3) (600, 900)
ttttttttt (696, 1280, 3) (544, 1000)
ttttttttt (491, 704, 3) (600, 860)
ttttttttt (1848, 2040, 3) (600, 662)

What is this and can this be avoided?

I see that data loader is being created 2 times in the code and 3rd time for grid features. If there is randomness in those transforms, then it may be a source of bugs because 3 dataloaders will contain different images.

I can disable this code by replacing it with:

image, transforms = image, []

Then image shapes are equal before and after the dataloader. Still there is a question - what is a purpose of this augmentation? Is it correct to remove it?