microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

questions about pre-training of markuplm #803

Open skygl opened 2 years ago

skygl commented 2 years ago

Describe Model I am using (MarkupLM)

I have some questions about pre-training of MarkupLM.

  1. there are many webpages with long text. how did you handle the long pages?
  2. I wonder how the web page node preprocessed when it exceeds the maximum depth.
wolfshow commented 2 years ago

@skygl 1. For long documents, we use the same pre-processing as LayoutLM which we split the documents into blocks with a length of 512. 2. Just trim the deeper nodes.