Open 14H034160212 opened 1 year ago
Hi, I just want to add that there is https://github.com/dandelin/ViLT/blob/master/vilt/modules/vilt_module.py here if you are looking for inspiration. 'objectives.compute_itm_wpa' is their implementation. i need to adapt this for my a closed-source project but I hope we can build something out here
I am using LayoutLMv3 object detection but not able to get input_ids, bbox and attention_mask only getting imges. Can you help?
Hi, for anyone who interested in the implementation of LayoutLMv3. Transfomers have updated the code for mask image modeling and the code is based on DEIT. You can inherit the code to implement the Mask Image Modeling for LayoutLMv3 and also you can also inherit the code from RoBERTa to implement the mask language modeling. For the word-patch alignment, I am still in progress. Free feel to have any discussion. Here are the links: RoBERTa mask language modeling example
DEIT mask image modeling example
More ideas for developing word patch alignment
Other related issue links https://github.com/huggingface/transformers/issues/13235 https://github.com/microsoft/unilm/issues/772