sgottsch / eventkg

The EventKG is a novel multilingual resource incorporating event-centric information extracted from several large-scale knowledge graphs such as Wikidata, DBpedia and YAGO, as well as less structured sources such as the Wikipedia Current Events Portal and Wikipedia event lists in five languages. The EventKG is an extensible event-centric resource modeled in RDF. It relies on Open Data and best practices to make event data spread across different sources available through a common representation and reusable for a variety of novel algorithms and real-world applications.
http://eventkg.l3s.uni-hannover.de/
46 stars 11 forks source link

EventKG

eventkg.l3s.uni-hannover.de

The EventKG is a multilingual resource incorporating event-centric information extracted from several large-scale knowledge graphs such as Wikidata, DBpedia and YAGO, as well as less structured sources such as Wikipedia Current Events and Wikipedia event lists in five languages. The EventKG is an extensible event-centric resource modeled in RDF. It relies on Open Data and best practices to make event data spread across different sources available through a common representation and reusable for a variety of novel algorithms and real-world applications.

Public SPARQL Endpoint

If you just want to run queries on the current version of EventKG, you can simply use our public SPARQL endpoint.

You can find a tutorial about writing SPARQL queries for EventKG here.

Configuration

Create a configuration file like the following to state where to store your EventKG version, and the languages and dumps to be used for extraction:

data_folder .../eventkg/data/
languages   en,fr,de,it,ru,pt,es,nl,pl,no,ru,hr,sl,bg,da
wikipedia   20220601
wikidata    20220601
dbpedia 2022.03.01

Currently, the 15 languages English (en), French (fr), German (de), Italian (it), Russian (ru), Portuguese (pt), Spanish (es), Dutch (nl), Polish (pl), Norwegian (no), Romanian (ru), Croatian (hr), Slovene (sl), Bulgarian (bg) and Danish (da) are supported. Timestamps of current Wikipedia dumps can be found on https://dumps.wikimedia.org/enwiki. Usually, the dump dates are consistent between languages. The chosen dump needs to say "Dump complete" on the dump's website. Wikidata dumps are listed on https://dumps.wikimedia.org/wikidatawiki/entities/. There is one dump for each language. DBpedia is dumped for all languages at once. The newest dump is listed, for example, on the top of https://databus.dbpedia.org/marvin/mappings/instance-types.

Run the extraction

The EventKG extraction pipeline consists of several steps described in the following. Consider that some of these step require some time and resources (e.g. for the data download, for processing the big Wikidata dump file, and for processing the Wikipedia XML files).

  1. Export the Pipeline class (de.l3s.eventkg.pipeline.Pipeline) as executable jar (Pipeline.jar).

  2. Start the data download using:

java -jar Pipeline.jar path_to_config_file.txt 1

  1. Run the first steps of extraction:

java -jar Pipeline.jar path_to_config_file.txt 2,3

  1. Export the Dumper class (de.l3s.eventkg.source.wikipedia.mwdumper.Dumper) as Jar (Dumper.jar). Run the extraction from the Wikipedia dump files for each language by running the following command (here for Portuguese, replace pt with other languages if needed). GNU parallel is required.

nohup parallel -j9 "bzip2 -dc {} | java -jar -Xmx6G -Xss40m Dumper.jar path_to_config_file.txt pt" :::: data/raw_data/wikipedia/pt/dump_file_list.txt 2> log_dumper.txt

  1. Start the final steps of extraction:

java -jar Pipeline.jar path_to_config_file.txt 4,5,6,7,8

  1. The resulting .nq files can be found in the folder data/output.

An example script to execute the whole pipeline is given in pipeline.sh.

Manual Configuration

EventKG extracts information from several reference sources and fits them into the EventKG schema. Therefore, several expressions needs to be defined manually. This includes mappings from source-specific property labels to the EventKG schema and language-specific temporal expressions, as explained below. If the reference sources get updated or a new language is included in EventKG, manual changes are necessary for these files.

Mappings

Several relations are directly mapped to properties defined in the EventKG schema, e.g. sem:hasSubEvent and sem:hasPlace. These mappings are defined in the following source-specific files.

File name Content Description Example
wikidata/property_names_locations.tsv Wikidata properties denoting location relations (e.g. FIFA World Cup 2018, sem:hasPlace, Russia). Each line contains a pair (property id | property label). P36 | capital
wikidata/temporal_property_names.tsv Wikidata properties denoting time relations. Each line contains a triple (property id | property label | s/e/b), where "s" are start time, "e" end times and "b" both start and end times. P569 | date of birth | s
wikidata/properties_sublocations.tsv Wikidata properties denoting sub and parent location relations (e.g. Paris, so:containedInPlace, France). Each line contains a triple (property id | property label | p/s), where "s" are sub relations and "p" are parent relations. P1376 | capital of | s
wikidata/event_blacklist_classes.tsv Wikidata classes whose instances may not be identified as events. Q1914636 | activity
yago/time_properties.tsv YAGO properties denoting time relations. Each line contains a pair (property id | s/e/b), where "s" are start time, "e" end times and "b" both start and end times. \<happenedOnDate> | b
dbpedia/part_of_properties.tsv DBpedia properties denoting "part of" relations. \<isPartOf>

Terms

For each language, a list of terms is needed that is used when extracting data from Wikipedia.

Examples:

Name Meaning Examples (en)
forbiddenLinks Links that are ignored Wikipedia:Citation_needed
forbiddenNameSpaces Link namespaces that are ignored Talk, User
talkSuffix Suffix of talk/discussion pages talk
talkPrefix Prefix of talk/discussion pages
forbiddenInternalLinks WT:, H:
tableOfContents Section title of the table of contents in the Wikipedia pages Contents
seeAlsoLinkClasses CSS class of "see also" links in Wikipedia hatnote
titlesNotToTranslate Section titles in the Wikipedia pages that are ignored See also, References
fileLabels Prefix of file links File
categoryLabel Prefix of category pages Category
imageLabels Prefix of image links Image
listPrefixes Prefix of Wikipedia pages that are lists Listsof
categoryPrefixes Prefix of category pages Category:
eventsLabels Section titles in Wikipedia event pages that denote list of textual events Events
monthNames List of month names (starting from January, one weekday per line. Alternatives separated by ";") January
weekdayNames List of weekday names (starting from Monday, one weekday per line. Alternatives separated by ";") Monday
eventCategoryRegexes Regexes that match Wikipedia categories with event pages (e.g. "Category:Political_events") .+events$

Event Date Expressions

For each language, a list of time expressions is needed that is used when extracting textual events from Wikipedia event lists.

Examples:

Name Meaning
predefined regexes A set of placeholders which are given in the code and can be re-used. No need to change this
new regexes New placeholders that can be re-used later on.
dayTitle Regex for Wikipedia page titles that represent a specific day. For example "^@regexMonth1@ @regexDay1@$" to find "March 15" for the Wikipedia article https://en.wikipedia.org/wiki/March_15. Other example: en: January 22, de: 22. Januar, fr: 22 janvier, pt: 22 de janeiro, ru: 22 Ñ��½�²�°Ñ€Ñ�
yearTitlePatterns Regexes for Wikipedia page titles that represent a specific year. For example "^(?[0-9]{@digitsInYear@}) in .*$" to find "2007 in philosophy" for the Wikipedia article https://en.wikipedia.org/wiki/2007_in_philosophy.
datePatterns A list of regexes to extract date expressions from event texts.
dateLinkResolvers Sometimes, dates are given as links, which are resolved using these regexes. The "" group denotes the anchor text. For example to find 474 BC in "*[[474 BC]] – [[Roman consul]] ...".

License

This project is licensed under the terms of the MIT license (see LICENSE.txt).