Maked Patch in ViT and VilT

guanhdrmq commented 1 year ago

System Info

Hi,

I did check in vit docs from this link thttps://huggingface.co/transformers/v4.6.0/model_doc/vit.html

It said "The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.

Not sure, Is there a mask function for image patch for ViT? If not, can you add this function in ViT or VilT? It would be gratefully. Many thanks

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

https://github.com/lucidrains/vit-pytorch/issues/97

mpp_trainer = MPP( transformer=model, patch_size=32, dim=1024, mask_prob=0.15, # probability of using token in masked prediction task random_patch_prob=0.30, # probability of randomly replacing a token being used for mpp replace_prob=0.50, # probability of replacing a token being used for mpp with the mask token )

Expected behavior

Hope add this masked patch function in ViT and VilT

NielsRogge commented 1 year ago

Hi,

This comment is actually outdated as currently, self-supervised pre-training beats supervised pre-training, with models like BEiT, MAE as well as SimMIM.

All 3 are based on masking patches for ViT. We do provide a ViTForMaskedImageModeling class exactly for this purpose. It also comes with a pre-training script, allowing you to pre-train a model for masked image modeling yourself on custom data: https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining.

We should update that comment ;) feel free to open a PR

guanhdrmq commented 1 year ago

Thank you very much for your answer. Can I use ViTForMaskedImageModeling in VilT as well? appreciate for your valuable answer.

guanhdrmq commented 1 year ago

Hi,

This comment is actually outdated as currently, self-supervised pre-training beats supervised pre-training, with models like BEiT, MAE as well as SimMIM.

All 3 are based on masking patches for ViT. We do provide a ViTForMaskedImageModeling class exactly for this purpose. It also comes with a pre-training script, allowing you to pre-train a model for masked image modeling yourself on custom data: https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining.

We should update that comment ;) feel free to open a PR

Hi Niels Rogge,

Thanks for replying. Appreciate for your valuable feedback.

So another problem is:

Can you add this function ViTForMaskedImageModeling in VilT as well? Not sure if it is ok.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge commented 1 year ago

Hi @guanhdrmq, ViLT has its own pre-training objectives, which are different from ViTForMaskedImageModeling. Hence this would require a new ViltForPreTraining class which includes all heads used during the pre-training of ViLT.

huggingface / transformers