kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
239 stars 24 forks source link

Disambiguation in French: Charles Ier (Charlesmagne) #29

Open lfoppiano opened 7 years ago

lfoppiano commented 7 years ago

I write here not to forget. Here there is an examples to be checked from the page of Charlemagne:

Charlemagne, du latin Carolus Magnus, ou Charles Ier dit « le Grand », né le 2 avril 742 (voire 747 ou 748)2, mort le 28 janvier 814 à Aix-la-Chapelle, est un roi des Francs et empereur. Il appartient à la dynastie des Carolingiens, à laquelle il a donné son nom.\nFils de Pépin le Bref, il est roi des Francs à partir de 768, devient par conquête roi des Lombards en 774 et est couronné empereur à Rome par le pape Léon III le 25 décembre 800, relevant une dignité disparue depuis la chute de l'Empire romain d'Occident en 476.\nRoi guerrier, il agrandit notablement son royaume par une série de campagnes militaires, en particulier contre les Saxons païens dont la soumission fut difficile et violente (772-804), mais aussi contre les Lombards en Italie et les musulmans d'Al-Andalus.

The token Charles Ier is disambiguated with the Charles Ier (empereur d'Autriche)

When searching for it in the term lookup there is no confidence and the id is not pointing to the right wikipedia page (but works fine the wikidata id):

screen shot 2017-09-01 at 17 21 59

Something to be checked

kermitt2 commented 6 years ago

It's because it comes from a disambiguation page, and it does not occur elsewhere neither as anchor nor as title - thus the lack of prior probability (it's not a confidence score here). So this was not really an error, rather a lack of usable data in wikipedia which leads to this non-used lexical entry.

This was fixed in branch 0.0.3 by setting a default prior in these disambiguation cases.

lfoppiano commented 6 years ago

I'm now re-checking with the new version.

I'm using this query:

{
    "text": "Charlemagne, du latin Carolus Magnus, ou Charles Ier dit « le Grand », né le 2 avril 742 (voire 747 ou 748)2, mort le 28 janvier 814 à Aix-la-Chapelle, est un roi des Francs et empereur. Il appartient à la dynastie des Carolingiens, à laquelle il a donné son nom.\nFils de Pépin le Bref, il est roi des Francs à partir de 768, devient par conquête roi des Lombards en 774 et est couronné empereur à Rome par le pape Léon III le 25 décembre 800, relevant une dignité disparue depuis la chute de l'Empire romain d'Occident en 476.\nRoi guerrier, il agrandit notablement son royaume par une série de campagnes militaires, en particulier contre les Saxons païens dont la soumission fut difficile et violente (772-804), mais aussi contre les Lombards en Italie et les musulmans d'Al-Andalus.",
    "shortText": "",
    "termVector": [],
    "language": {
        "lang": "fr"
    },
    "entities": [],
    "mentions": [
        "ner",
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "customisation": "generic"
}

Now Charles Ier is disambiguated as Charles Ier (roi d'Angleterre) but, most interesting result is Carolus Magnus disambiguated as the board game

screen shot 2017-12-08 at 12 38 10

which is somehow related, but bizarre, as from the term look up the right entry is pulled out

screen shot 2017-12-08 at 12 45 12
kermitt2 commented 6 years ago

You can't use the old French model with the new version, features are different. The new disambiguation models for French have to be created first.

lfoppiano commented 6 years ago

With the latest model Charles Ier is still disambiguated with Charles Ier (roi d'Angleterre) but Carolus Magnus is not taken in consideration (which is better I think).

screen shot 2018-01-04 at 19 52 53
lfoppiano commented 6 years ago

It seems that Charles Ier doesn't have a specific page or a reference to the Charlemagne page, so it's probably more difficult to find it as a candidate entity for Charlemagne.

Any though?

kermitt2 commented 6 years ago

You're mixing different things :)

Regarding Charles Ier, the problem is indeed that in the French Wikipedia it is not a mention "realizing" the entity Charlemagne. The are plenty of other kings in Wikipedia that are referred to with the mention Charles Ier. Interestingly the variant Charles I is used as an anchor leading to Charlemagne Wikipedia page, but the conditional probability is so low (0.005), that in practice it won't be considered anyway.

The solution is possibly to exploit also the labels of Wikidata - not done now, because we don't have statistical information about them to perform the disambiguation. For French, Charles Ier is a label introduced for Q3044. The question on how to use these labels in the disambiguation process without statistical information remains however open! Maybe good priors? Label propagation?

Regarding mentions that appear or not following different queries, it's another issue. They are usually very close to the threshold and I suppose sensitive to random seed, so can keep track of that in issue #51.

lfoppiano commented 6 years ago

@kermitt2 all correct. I was thinking maybe on something (more simple, or maybe the same thing with a different name?) we could extend the candidates matching using the "also known as" information within wikidata to increase the possibility of match of a wikipedia article when different forms are used (e.g. Charles Ier -> Wikidata:Q3044 -> Wikipedia:Charlemagne)?

@tantikristanti your comment should be better moved to task #51 ;-)

kermitt2 commented 6 years ago

@lfoppiano usually the problem is that there are too many entity candidates for a given mention... If we add more entity candidates for a given mention without statistical ground, we end up in average with an ambiguity explosion, endless runtime, much lower accuracy... The labels in Wikidata are numerous from the most common to the very very rare, without any usage information, so we cant use such a simple approach. Currently we limit the entity candidate for a given mention to the top-5 most probable ones to manage this problem. Increasing the number of candidate results in significant accuracy decrease.