lucidrains / vit-pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch
MIT License
20.56k stars 3.05k forks source link

Masked tokens #63

Closed zankner closed 3 years ago

zankner commented 3 years ago

Just a quick question. I was wondering if the masks which get passed to the model at inference are for the purposes of masked tokens for self-supervision or if they are a different mask? Thanks

lucidrains commented 3 years ago

@zankner Hi Zach! It's a different mask, for masking out attention to specific patches. It wouldn't matter at all if you always use the same sized image, but if you somehow have different sized images padded to be a full square, you can selectively mask out the padding patches

lucidrains commented 3 years ago

@zankner I feel like I shouldn't have agreed to build it (some user requested it when the repo was still young) it's really not needed for the majority of the use-cases (same sized images), and just makes the repo more complicated than it needs to be

lucidrains commented 3 years ago

@zankner the masked training you are thinking of won't work with ViT anyways

zankner commented 3 years ago

I might be wrong, but in the original paper didn't they perform masked token prediction for self-supervision?

zankner commented 3 years ago

"We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%)."

lucidrains commented 3 years ago

@zankner I missed that section! Wow, so it can work, with predicting the 3 bit mean of the colors being enough

zankner commented 3 years ago

Yeah I think so. I made an implementation of it already on my own mock of the vision transformer. Would there be any interest in a PR for that?

lucidrains commented 3 years ago

@zankner I would gratefully accept! 💯

lucidrains commented 3 years ago

I think the latest self supervised learning techniques will probably work better https://github.com/lucidrains/vit-pytorch#self-supervised-training

zankner commented 3 years ago

That's probably true, but I think at least having the ability to do masked patch prediction allows for people to do different research or experiment with new things.

guanfuchen commented 3 years ago

@zankner I am also interested about the ability to do masked patch prediction using VIT. I do lots of experiments about VIT using BYOL for good transfering performance, but I think VIT can enhance itself using the tricks from BERT and GPT. So what is your plan for the feature?

zankner commented 3 years ago

@guanfuchen If people would want it I can start working on a PR for it. I don't have that much free time so would you be able to help at all in a PR? I have it all set up on my implementation of VIT, so the work would mostly be integrating it into this repo.

guanfuchen commented 3 years ago

@zankner yes, you can give the implementation, I will test and merge it.

zankner commented 3 years ago

@lucidrains @guanfuchen - Started PR for masked patch prediction. Currently is a PR draft. Still have work to do but wanted to post it in case anyone has suggestions or optimizations to be made.

lucidrains commented 3 years ago

@guanfuchen @zankner merged here https://github.com/lucidrains/vit-pytorch#masked-patch-prediction