studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 101 forks source link

what is the pre-training corpus in LUKE. #112

Closed lshowway closed 2 years ago

lshowway commented 2 years ago

Thanks for your work. According to the paper, the pertaining corpus is Wikipedia with entity annotations. So, is the corpus the same as NTEE? Or, could you provide the link or something else for me to get more about this corpus?

Thanks.

ikuyamada commented 2 years ago

Hi @lshowway,

The pretraining corpus is created using a Wikipedia dump which can be downloaded here. The dump file is preprocessed using the build_wikipedia_pretraining_dataset command. The dump_db_file can be built using build-dump-db command of Wikipedia2Vec. Since the NTEE model was also trained using a Wikipedia dump, the pretraining corpus of NTEE is the same as that of LUKE.