How is "lilt-only-base" bin file is created

jpWang / LiLT

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

MIT License

335 stars 40 forks source link

How is "lilt-only-base" bin file is created #22

Closed vibeeshan025 closed 1 year ago

vibeeshan025 commented 1 year ago

Can you please provide us with more information regarding "lilt-only-base" file and how the model was created?

Since the base file is just 22MB in size, I would like to know what kind of dataset, parameters or logic used to create this.

I am trying to figure out what are the possibilities available, and what is the starting point I should read to get to know about creating such models. Please give me more references to read.

jpWang commented 1 year ago

Hi, you can read our original paper at https://aclanthology.org/2022.acl-long.534/. As explained in it, LiLT-base+En-Roberta are pre-trained using English docs. And the provided "lilt-only-base" is exactly the pre-trained LiLT-base part. It can be used to combine different textual models to deal with docs in different languages during fine-tuning.

vibeeshan025 commented 1 year ago

Hi, you can read our original paper at https://aclanthology.org/2022.acl-long.534/. As explained in it, LiLT-base+En-Roberta are pre-trained using English docs. And the provided "lilt-only-base" is exactly the pre-trained LiLT-base part. It can be used to combine different textual models to deal with docs in different languages during fine-tuning.

I understand the usage. But I am very curious how the file "lilt-only-base" is created. As you have mentioned what is "pre-trained LiLT-base part" how that specific base part is created. We all know how roberta-en is created and from your provided code how "gen_weight_roberta_like.py" generates the base + roberta model.

What does the base part contains.

jpWang commented 1 year ago

Pytorch uses a dict-like format to store weight name-value pairs in 'pytorch_model.bin' files. We just filter out the name-value pairs of the LiLT part by weight names from the pre-trained checkpoint to create "lilt-only-base".