WEXEA is an exhaustive Wikipedia entity annotation system, to create a text corpus based on Wikipedia with exhaustive annotations of entity mentions, i.e. linking all mentions of entities to their corresponding articles.
WEXEA runs through several stages of article annotation and the final articles can be found in the 'final_articles' folder in the output directory. Articles are separately stored in a folder named after the first 3 letters of the title (lowercase) and sentences are split up leading to one sentence per line. Annotations follow the Wikipedia conventions, just the type of the annotation is added at the end.
WEXEA for...
These datasets can be used as-is. Each archive contains a single file with each article concatenated. Articles themselves contain original as well as new annotations of the following format:
The annotation type of format 1 can be ignored (type "annotation" corresponds to original annotations, all others are new). Annotations of format 2 are CoreNLP annotations without corresponding Wikipedia article.
Download (including models for languages other than English) CoreNLP from https://stanfordnlp.github.io/CoreNLP/index.html
Start server:
java -mx16g -cp "<path to corenlp files>" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -threads 6
Entity Linker including models used from https://github.com/nitishgupta/neural-el.
Download resources from repository and adjust path to resources folder in src/entity_linker/configs/config.ini
.
server.py starts a server and opens a website that can be used to visualize an article with Wikipedia links (blue) and unknown entities (green).
Files we used for evaluation (see Michael Strobl's PhD thesis), can be found in the data folder.
32GB of RAM are required (it may work with 16, but not tested) and it should take around 2-3d to finish with a full English Wikipedia dump (less for other languages).
Time consumption was measured when running on a Ryzen 7 2700X with 64GB of memory. Data was read from and written to a hard drive. Runtimes lower for languages other than English.
Create all necessary dictionaries.
Removes most Wiki markup, irrelevant articles (e.g. lists or stubs), extracts aliases and separates disambiguation pages.
A number of processes can be set to speed up the parsing process of all articles. However, each process consumes around 7.5GB of memory.
Run CoreNLP NER and find other entities based on alias/redirect dictionaries.
Run co-reference resolution and EL.
Please cite the following papers:
Original WEXEA publication:
Strobl, Michael, Amine Trabelsi, and Osmar R. Zaïane. "WEXEA: Wikipedia exhaustive entity annotation." Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020.
Updated version (from which the linked datasets above are derived):
Strobl, Michael, Amine Trabelsi, and Osmar R. Zaiane. "Enhanced Entity Annotations for Multilingual Corpora." Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022.