Closed guanhdrmq closed 1 year ago
Hi,
This comment is actually outdated as currently, self-supervised pre-training beats supervised pre-training, with models like BEiT, MAE as well as SimMIM.
All 3 are based on masking patches for ViT. We do provide a ViTForMaskedImageModeling class exactly for this purpose. It also comes with a pre-training script, allowing you to pre-train a model for masked image modeling yourself on custom data: https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining.
We should update that comment ;) feel free to open a PR
Thank you very much for your answer. Can I use ViTForMaskedImageModeling in VilT as well? appreciate for your valuable answer.
Hi,
This comment is actually outdated as currently, self-supervised pre-training beats supervised pre-training, with models like BEiT, MAE as well as SimMIM.
All 3 are based on masking patches for ViT. We do provide a ViTForMaskedImageModeling class exactly for this purpose. It also comes with a pre-training script, allowing you to pre-train a model for masked image modeling yourself on custom data: https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining.
We should update that comment ;) feel free to open a PR
Hi Niels Rogge,
Thanks for replying. Appreciate for your valuable feedback.
So another problem is:
Can you add this function ViTForMaskedImageModeling in VilT as well? Not sure if it is ok.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @guanhdrmq, ViLT has its own pre-training objectives, which are different from ViTForMaskedImageModeling
. Hence this would require a new ViltForPreTraining
class which includes all heads used during the pre-training of ViLT.
System Info
Hi,
I did check in vit docs from this link thttps://huggingface.co/transformers/v4.6.0/model_doc/vit.html
It said "The best results are obtained with supervised pre-training, which is not the case in NLP. The authors also performed an experiment with a self-supervised pre-training objective, namely masked patched prediction (inspired by masked language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
Not sure, Is there a mask function for image patch for ViT? If not, can you add this function in ViT or VilT? It would be gratefully. Many thanks
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
https://github.com/lucidrains/vit-pytorch/issues/97
mpp_trainer = MPP( transformer=model, patch_size=32, dim=1024, mask_prob=0.15, # probability of using token in masked prediction task random_patch_prob=0.30, # probability of randomly replacing a token being used for mpp replace_prob=0.50, # probability of replacing a token being used for mpp with the mask token )
Expected behavior
Hope add this masked patch function in ViT and VilT