microsoft / XPretrain

Multi-modality pre-training
Other
471 stars 37 forks source link

About LF-VILA code in PatchEmbed3D of video encoder #36

Open musicman217 opened 8 months ago

musicman217 commented 8 months ago

the padding seems not right, or maybe i made a mistake

# padding
        _, _, D, H, W = x.size() 
        if H % self.patch_size[0] != 0: 
            x = F.pad(x, (0, 0, 0, self.patch_size[1] - H % self.patch_size[1]))
        if W % self.patch_size[1] != 0:
            x = F.pad(x, (0, 0, 0, 0, 0, self.patch_size[0] - D % self.patch_size[0]))

owing to patch_size=[1, 8, 8] where 8x8 is HxW in implementation, should it be padded in H and W dimension? condition H % self.patch_size[0] != 0 and W % self.patch_size[1] != 0 make me lost thanks a lot!