microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.61k stars 220 forks source link

assert num_buckets == self.num_buckets error #220

Open gudrb opened 5 months ago

gudrb commented 5 months ago

I am trying to use the mini_deit_tiny_patch16_224 with finetuning another subtask having different sequence size of 18 (num of patches) with dimension 192. when operate under code for blk in self.blocks: x = blk(x) i get the error from irpe.py file's in line 574 code "assert num_buckets == self.num_buckets" num_buckets is 50 but self.num_buckets is 49. Do u know why this problem happens and how can i fix it?

wkcn commented 5 months ago

Hi @gudrb , thanks for your attention to our work!

Does the class token exist in the fine-tuned model?

If the class token exists (use_cls_token=True), please set skip=1 in https://github.com/microsoft/Cream/blob/main/MiniViT/Mini-DeiT/mini_deit_models.py#L16. It means that the class token is skipped to compute relative positional encoding.

If not, namely (use_cls_token=False), skip should be set to 0.

gudrb commented 5 months ago

Thank you for answering,

I am not using class token, but still i tried to use tried to use skip=1 option, and it gives the key error when i load the pretrained model self.v = create_model( 'mini_deit_tiny_patch16_224', pretrained=False, num_classes=1000, drop_rate=0, drop_path_rate=0.1, drop_block_rate=None) checkpoint = torch.load('./checkpoints/mini_deit_tiny_patch16_224.pth', map_location='cpu') self.v.load_state_dict(checkpoint['model'])

I tried to use random variable and observed the blk fuction such as

    x = torch.randn((2, 196, 192), device=x.device)

    for blk in self.v.blocks:
        x = blk(x)
    x = self.v.norm(x)

and i found when i change the second dimension of variable x to another value such as 196 -> N (not 196), then i get the error (num_buckes is 50, and self.num_buckets is 49) File "/data/hyounggyu/Mini-DeiT/irpe.py", line 573, in _get_rp_bucket assert num_buckets == self.num_buckets

Is it possible to use a pretrained model that utilizes IRPE for a different sequence length, such as a varying number of patches?

wkcn commented 5 months ago

@gudrb

Is it possible to use a pretrained model that utilizes IRPE for a different sequence length, such as a varying number of patches?

Yes. You need to pass the two arguments width and height for rpe_k, rpe_q and rpe_v.

Example: https://github.com/microsoft/Cream/blob/main/iRPE/DETR-with-iRPE/models/rpe_attention/rpe_attention_function.py#L330

iRPE is a 2D relative position encoding. If height and width are not specified, they are set to the square root of the sequence length. It leads to wrong number of buckets.

https://github.com/microsoft/Cream/blob/main/MiniViT/Mini-DeiT/irpe.py#L553

gudrb commented 5 months ago

Now, it is working. I modified the code for the MiniAttention class from (https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L97) to manually process my patch sequence (9 x 2) from a spectrogram image.

    # image relative position on keys
    if self.rpe_k is not None:
        attn += self.rpe_k(q,9,2)

I hope this is the correct way to utilize the MiniAttention class when fine-tuning the task with a different sequence length.

Thank you.

gudrb commented 4 months ago

Do I need to crop or interpolate pretrained relative positional encoding parameters when the sequence length is changed?

When I use the pretrained Mini-DeiT with positional encodings (both absolute and relative), in the case of absolute positional encoding, if the modified sequence length is shorter or longer than 14, I employ cropping and interpolation, respectively.

            # get the positional embedding from deit model,  reshape it to original 2D shape.
            new_pos_embed = self.v.pos_embed.detach().reshape(1, self.original_num_patches, self.original_embedding_dim).transpose(1, 2).reshape(1, self.original_embedding_dim, self.oringal_hw, self.oringal_hw)
            # cut (from middle) or interpolate the second dimension of the positional embedding
            if t_dim <= self.oringal_hw:
                new_pos_embed = new_pos_embed[:, :, :, int(self.oringal_hw / 2) - int(t_dim / 2): int(self.oringal_hw / 2) - int(t_dim / 2) + t_dim]
            else:
                new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(self.oringal_hw, t_dim), mode='bicubic')
            # cut (from middle) or interpolate the first dimension of the positional embedding
            if f_dim <= self.oringal_hw:
                new_pos_embed = new_pos_embed[:, :, int(self.oringal_hw / 2) - int(f_dim / 2): int(self.oringal_hw / 2) - int(f_dim / 2) + f_dim, :]
            else:
                new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(f_dim, t_dim), mode='bicubic')
            # flatten the positional embedding
            new_pos_embed = new_pos_embed.reshape(1, self.original_embedding_dim, num_patches).transpose(1,2)
wkcn commented 4 months ago

@gudrb No. You don't. Relative position encoding can be adapted with the longer sequence.