microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.21k stars 2.55k forks source link

Script for LayoutXLM pretraining #375

Closed Armandgrd closed 3 years ago

Armandgrd commented 3 years ago

Hello, I would like to continue the pre-training of the LayoutXLM on my own domain data, nevertheless I only found scripts to perform fine-tuning on particular tasks (funsd and xfun). Is it possible for you to provide the script which was used for the pre-training step , it would be incredibly useful ? I am also interested in how you perform the visual tasks described in the paper (MMLM, TLM, XlCo) .

By the way, thanks for your great work on this models.

wolfshow commented 3 years ago

@Armandgrd We have no bandwidth for the pre-training code release right now, but have plan for this in the future.

Armandgrd commented 3 years ago

Hello, thanks for your answer @wolfshow ,

I rewrote the pretraining code anyway, It would be helpful even if you could provide the code as it is now, even with a non-working code , it's just to check if I implemented something wrong.

Is it possible to get the parameters of the pre-trained model, for now, I am using the fine-tuned version and the parameters of the MLM head are unfortunately initialized randomly, which is a shame imo.

I would also be pleased to contribute 🙂 , if you need help on something and that can help you free some time.

valentinkoe commented 3 years ago

@Armandgrd would you be willing to share your version of the pre-training code?

Armandgrd commented 3 years ago

@valentindey sure, have a look at this gist let me know if you need anything else. Hope it helps

valentinkoe commented 3 years ago

@Armandgrd thanks a lot for sharing! It definitely helps. Do you happen to also have the code for preparing data for the different training tasks?

Armandgrd commented 3 years ago

@valentindey Un fortunately, I only implemented the part for the Multilingual Masked Visual-Language Modeling, and the code for preparing data is highly dependant on the format of your documents, I am not sure it will help.

The high level is:

  1. read document images
  2. detect word and boxes using pdfplumber (the orginal authors used PyMuPDF) including CLF in a CustomDocDataset and data loader
  3. tokenize text, create masks
  4. fit and evaluate cross entropy

It would be great to have the author code for the Text-Image Alignment and Matching since it is the main innovation of their paper. I plan to implement them in the near future: