Open lfoppiano opened 7 years ago
It's because it comes from a disambiguation page, and it does not occur elsewhere neither as anchor nor as title - thus the lack of prior probability (it's not a confidence score here). So this was not really an error, rather a lack of usable data in wikipedia which leads to this non-used lexical entry.
This was fixed in branch 0.0.3 by setting a default prior in these disambiguation cases.
I'm now re-checking with the new version.
I'm using this query:
{
"text": "Charlemagne, du latin Carolus Magnus, ou Charles Ier dit « le Grand », né le 2 avril 742 (voire 747 ou 748)2, mort le 28 janvier 814 à Aix-la-Chapelle, est un roi des Francs et empereur. Il appartient à la dynastie des Carolingiens, à laquelle il a donné son nom.\nFils de Pépin le Bref, il est roi des Francs à partir de 768, devient par conquête roi des Lombards en 774 et est couronné empereur à Rome par le pape Léon III le 25 décembre 800, relevant une dignité disparue depuis la chute de l'Empire romain d'Occident en 476.\nRoi guerrier, il agrandit notablement son royaume par une série de campagnes militaires, en particulier contre les Saxons païens dont la soumission fut difficile et violente (772-804), mais aussi contre les Lombards en Italie et les musulmans d'Al-Andalus.",
"shortText": "",
"termVector": [],
"language": {
"lang": "fr"
},
"entities": [],
"mentions": [
"ner",
"wikipedia"
],
"nbest": false,
"sentence": false,
"customisation": "generic"
}
Now Charles Ier
is disambiguated as Charles Ier (roi d'Angleterre) but, most interesting result is Carolus Magnus
disambiguated as the board game
which is somehow related, but bizarre, as from the term look up the right entry is pulled out
You can't use the old French model with the new version, features are different. The new disambiguation models for French have to be created first.
With the latest model Charles Ier is still disambiguated with Charles Ier (roi d'Angleterre) but Carolus Magnus is not taken in consideration (which is better I think).
It seems that Charles Ier doesn't have a specific page or a reference to the Charlemagne page, so it's probably more difficult to find it as a candidate entity for Charlemagne.
Any though?
You're mixing different things :)
Regarding Charles Ier
, the problem is indeed that in the French Wikipedia it is not a mention "realizing" the entity Charlemagne
. The are plenty of other kings in Wikipedia that are referred to with the mention Charles Ier
. Interestingly the variant Charles I
is used as an anchor leading to Charlemagne Wikipedia page, but the conditional probability is so low (0.005), that in practice it won't be considered anyway.
The solution is possibly to exploit also the labels of Wikidata - not done now, because we don't have statistical information about them to perform the disambiguation. For French, Charles Ier
is a label introduced for Q3044. The question on how to use these labels in the disambiguation process without statistical information remains however open! Maybe good priors? Label propagation?
Regarding mentions that appear or not following different queries, it's another issue. They are usually very close to the threshold and I suppose sensitive to random seed, so can keep track of that in issue #51.
@kermitt2 all correct. I was thinking maybe on something (more simple, or maybe the same thing with a different name?) we could extend the candidates matching using the "also known as" information within wikidata to increase the possibility of match of a wikipedia article when different forms are used (e.g. Charles Ier -> Wikidata:Q3044 -> Wikipedia:Charlemagne)?
@tantikristanti your comment should be better moved to task #51 ;-)
@lfoppiano usually the problem is that there are too many entity candidates for a given mention... If we add more entity candidates for a given mention without statistical ground, we end up in average with an ambiguity explosion, endless runtime, much lower accuracy... The labels in Wikidata are numerous from the most common to the very very rare, without any usage information, so we cant use such a simple approach. Currently we limit the entity candidate for a given mention to the top-5 most probable ones to manage this problem. Increasing the number of candidate results in significant accuracy decrease.
I write here not to forget. Here there is an examples to be checked from the page of
Charlemagne
:The token
Charles Ier
is disambiguated with theCharles Ier (empereur d'Autriche)
When searching for it in the
term lookup
there is no confidence and the id is not pointing to the right wikipedia page (but works fine the wikidata id):Something to be checked