Pre-training procedures for Entity Disambiguation

studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings

Apache License 2.0

705 stars 102 forks source link

Pre-training procedures for Entity Disambiguation #157

Closed MrZilinXiao closed 1 year ago

MrZilinXiao commented 2 years ago

Hi @ikuyamada. Thanks for your great work. Would you mind providing the pretraining scripts or procedures to train the checkpoints you guys provide here? Relevant issue is: https://github.com/studio-ousia/luke/issues/126

ikuyamada commented 1 year ago

@MrZilinXiao I am sorry for the delayed reply! I am working on this in this branch. This branch is work-in-progress, so I will notify you when I complete the work.

ikuyamada commented 1 year ago

I have completed the work and the pretraining instruction is available here.

MrZilinXiao commented 1 year ago

Hi @ikuyamada. Thanks for your great contribution. I have seen your commits and want to check the following with you:

By comparing the commit you made for README https://github.com/studio-ousia/luke/commit/c346f2656c2f5084e773603274c4c17b2fcfdb28 and pertaining procedures for luke (https://github.com/studio-ousia/luke/blob/master/pretraining.md), is the only difference between them the hyperparameters (epochs, etc.) and create_candidate_data.py file that only include candidates entities in the entity vocab instead of luke containing 500k most common entities? If there is something I missed, please let me know :)

ikuyamada commented 1 year ago

Hi @MrZilinXiao, Thanks for your reply!

In addition to the difference of the entity vocabulary, the main differences are as follows:

BERT is used instead of RoBERTa as the base model
Masked language modeling is disabled by setting --masked-lm-prob to 0.0

MrZilinXiao commented 1 year ago

Great, thanks for pointing that out. Wish you success in your future research career.

MrZilinXiao commented 1 year ago

Hi @ikuyamada. Created a PR for some mistake in the instructions: https://github.com/studio-ousia/luke/pull/164.

MrZilinXiao commented 1 year ago

Hi @ikuyamada. Sorry to disturb you again since we are following your work. And your experience could relieve some burdens :)

The config.json in the weights you uploaded recently on GDrive shows a bert-large-uncased model with entity_emb_size=1024. The GlobalED paper mentioned decomposing entity embedding into two smaller matrices multiplied. https://github.com/studio-ousia/luke/blob/a3e51de0661537fc87d6a80c4a90c0a27763d364/luke/model.py#L53 here indicates that when entity_emb_size==hidden_size (namely the config.json implies), no dense layer will be used. Is the weights your team uploaded trained in the same way the paper mentioned?
If not, would you mind sharing the performance difference between base v.s. large, decomposed embedding table v.s. single one? No extra experiments are needed, just some experience will do the trick.

ikuyamada commented 1 year ago

Hi @MrZilinXiao,

Thank you for your continued interest in LUKE.

The GlobalED paper mentioned decomposing entity embedding into two smaller matrices multiplied.

Unlike the LUKE model, we do not decompose entity embeddings in our entity disambiguation model. I think our entity disambiguation paper does not mention the decomposition of entity embeddings.