microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

Some ideas for developing Mask Language Modeling, Mask Image Modeling and Word-Patch Alignment for LayoutLMv3 #1076

Open 14H034160212 opened 1 year ago

14H034160212 commented 1 year ago

Hi, for anyone who interested in the implementation of LayoutLMv3. Transfomers have updated the code for mask image modeling and the code is based on DEIT. You can inherit the code to implement the Mask Image Modeling for LayoutLMv3 and also you can also inherit the code from RoBERTa to implement the mask language modeling. For the word-patch alignment, I am still in progress. Free feel to have any discussion. Here are the links: RoBERTa mask language modeling example

DEIT mask image modeling example

More ideas for developing word patch alignment

Other related issue links https://github.com/huggingface/transformers/issues/13235 https://github.com/microsoft/unilm/issues/772

dariuszlee commented 1 year ago

Hi, I just want to add that there is https://github.com/dandelin/ViLT/blob/master/vilt/modules/vilt_module.py here if you are looking for inspiration. 'objectives.compute_itm_wpa' is their implementation. i need to adapt this for my a closed-source project but I hope we can build something out here

suresh1505 commented 1 year ago

I am using LayoutLMv3 object detection but not able to get input_ids, bbox and attention_mask only getting imges. Can you help?