hltcoe / gazetteer-collection

12 stars 1 forks source link

gazetteer-collection

Overview

This repository has a set of gazetteers used in a system to improve the performance of a neural named entity recognition system by adding input features that indicate a word is part of a name. The system in described in two papers:

The gazetteer files were generated by searching Wikidata via SPARQL queries sent to the public query server to retrieve both canonical names (e.g., Johns Hopkins University) and aliases (e.g., JHU, Johns Hopkins, Hopkins) in each of the languages studied. The first step was to construct a mapping from our project’s 15 target types to Wikidata’s fine-grained type system. Our types included four common core types (person, organization, geopolitical entity (GPE), location) and eleven additional types (airport, chemical, commercial organization, computer hardware/software, event, facility, government building, money, political organization, title, vehicle, weapon).

The mapping for some types was simple: person corresponds to Wikidata’s Q5 and vehicle to Q42889. Others hada complex mapping that eliminate Wikidata subtypes that seemed too specialized (e.g., lunar craters and ice rumples from Wikidata’s geographic object) or allow us to retrieve more entity names given the public server’s one-minute query timeout

The initial name lists were filtered by type-dependant regular expressions to delete names we thought to be unhelpful (e.g., Francis of Assisi as a person because historical figures are unlikely to be mentioned in our targeted genres), remove Wikipedia artifacts (e.g., parentheticals), and eliminate punctuation, names that were too short or too long, and duplicate names. Although one could say that these changes bias the gazetteers, there is no reason not engineer a gazetteer in a way that is most helpful for the data. Wikidata is still being used in an automated way since we are relying on available labels.

We produced additional lists for Russian using a custom script that generates type-sensitive inflected and familiar forms of canonical names and aliases. For an extreme example, the Russian name for the personVladimir Vladimirovich Putin (Владимир ВладимировичПутин) produces morethan 100 variations. The result is a collection of 96 gazetteer files more than 16M entity names, 4.2M for English, 2.1M for Russian and 584K for Chinese with an additional 8.7M Russian names produced by our morphological scripts. We kept the gazetteers for canonical names, aliases, and inflected forms separate to facilitate experimentation.

Content

For more information

For more information, contact the authors as