Closed lfoppiano closed 6 years ago
In addition, even without wikipediaExternalRef
and wikidataId
the entity should be "forced" also as user-specified mention (the mention has to be present in the result, disambiguated or not).
I think this is actually necessary if we want to plug entity-fishing to GERBIL, as the evaluations are made only on entity disambiguation, with the mentions given in advance.
Here the mentions are marked as coming from an user
only when the wikipediaExternalRef
is present.
Few questions:
What is the approach: a. more constraints:
OR
b. less constrains: the user will input anything, mention/entity with /without disambiguated information
- are them to be kept as long as they are in the input list of entities?
I assume you are talking about the mention? I would say the mention should be kept in the final output, even if no entity is found for the mention - the entity is left "NIL"
do you mean what if the mention description is not consistent with the input text? (wrong offsets?) or do you mean something else?
About a) -> no offset means not a mention, thus the entity "just here to help" should be passed through a custumization, not as provided annotation
-> for me, mention description provided by the user has to be consistent with the input text (valid offset, valid span), otherwise discarded
b) The foreseen design and requirements are as follow: to pass entity not as annotation, user should use the customization, to pass a mention with or without disambiguated entity, he should use user annotations (and annotation description has to be valid wrt input text)
ok so to summarise:
we do consider all entries from users and they have to be present in the output (whether they contains or not wikipedia/wikidata ids)
entities provided have to be consistent with the text, alternatively use customisation.
In this case it's a neat solution, however by processing long text, and wanting to process it in chuncks (paragraphs or group of sentences) it would require to supply the whole text for each query, instead of supplying only the part to be processed with the previously obtained annotations.
I'm fine to keep it as it is now, just to be sure we are on the same page.
We can send only a part to be processed (and not the whole text) and provide the previously obtained entities in the customization as context - that was actually the idea for processing a long text, the client takes care of the segmentation (in paragraphs) and provides a context for the segment via the customization.
Ok then the point number 2 cannot be applied as it is...
We could then say: mentions have to be consistent with the text where entities can be also "there to help".
after late discussion and thinking, we can summarise:
when a mention or an entity is not correctly aligned with the text (wrong offset) it's ignored
Regarding the example at the first comment, the entity is ignored as it's outside the boundaries of the text (there is an WARN displayed in the log) and is removed from the response.
The test for this issue was done as follows:
{
"text": "Austria invaded and fought the Serbian army at the Battle of Cer and Battle of Kolubara beginning on 12 August. ",
"shortText": "",
"termVector": [],
"language": {
"lang": "en"
},
"entities": [],
"mentions": [
"ner",
"wikipedia"
],
"nbest": false,
"sentence": false,
"customisation": "generic"
}
a. the mention is -> name + offsets in the text Example: August b. the entity is -> name + ids (wikipedia or wikidata) + offset Example: Austria, Serbian army, Battle of Cer, Battle of Kolubara
"entities": [
{
"rawName": "Austria",
"type": "LOCATION",
"offsetStart": 0,
"offsetEnd": 7,
"nerd_score": 1,
"nerd_selection_score": 0.761,
"wikipediaExternalRef": 26964606,
"wikidataId": "Q40",
"domains": [
"Atomic_Physic",
"Engineering",
"Administration",
"Geology",
"Oceanography",
"Earth"
]
},
{
"rawName": "Serbian army",
"offsetStart": 31,
"offsetEnd": 43,
"nerd_score": 0.7854,
"nerd_selection_score": 0.6443,
"wikipediaExternalRef": 10072531,
"wikidataId": "Q1209256",
"domains": [
"Military"
]
},
{
"rawName": "Battle of Cer",
"offsetStart": 51,
"offsetEnd": 64,
"nerd_score": 0.7854,
"nerd_selection_score": 0.6357,
"wikipediaExternalRef": 1614762,
"wikidataId": "Q697748",
"domains": [
"Military"
]
},
{
"rawName": "Battle of Kolubara",
"offsetStart": 69,
"offsetEnd": 87,
"nerd_score": 0.7854,
"nerd_selection_score": 0.6312,
"wikipediaExternalRef": 2167279,
"wikidataId": "Q682699",
"domains": [
"Military"
]
},
{
"rawName": "August",
"type": "PERIOD",
"offsetStart": 104,
"offsetEnd": 110,
"nerd_score": 0.8,
"nerd_selection_score": 0
}
]
}
Let's take an example to force the entity of German army
to be forced into the mention and entity of Serbian army
in the text. For instance, it has been chosen the disambiguation of German army
with the wikipedia page 11702744
and wikidata id Q701923
into the mention Serbian army
a. with correct offsets
offsetStart: 31
and offsetEnd: 43
. The result must be a new mention and/or entity that is forced by the user.
{
"text": "Austria invaded and fought the Serbian army at the Battle of Cer and Battle of Kolubara beginning on 12 August.",
"language": {
"lang": "en"
},
"entities": [
{
"rawName": "German Army",
"offsetStart": 31,
"offsetEnd": 43,
"wikipediaExternalRef": 11702744,
"wikidataId": "Q701923"
}
]
}
Serbian army
in the text into a mention and an entity forced by the user which is became German army
.
{
"rawName": "German Army",
"offsetStart": 31,
"offsetEnd": 43,
"nerd_score": 1,
"nerd_selection_score": 0,
"wikipediaExternalRef": 11702744,
"wikidataId": "Q701923"
}
b. with incorrect offsets
offsetStart: 1107
and offsetEnd: 1118
(the offset 1107-1118 is out of boundaries). The entity forced by the user with incorrect offsets should be ignored.
{
"text": "Austria invaded and fought the Serbian army at the Battle of Cer and Battle of Kolubara beginning on 12 August.",
"language": {
"lang": "en"
},
"entities": [
{
"rawName": "German Army",
"offsetStart": 1107,
"offsetEnd": 1118,
"wikipeædiaExternalRef": 11702744,
"wikidataId": "Q701923"
}
]
}
The result in JSON format shows that the Serbian army
mention and entity stay as it is.
{
"rawName": "Serbian army",
"offsetStart": 31,
"offsetEnd": 43,
"nerd_score": 0.7854,
"nerd_selection_score": 0.6443,
"wikipediaExternalRef": 10072531,
"wikidataId": "Q1209256",
"domains": [
"Military"
]
}
Result: Pass
This issue is closed with the reason that all test cases are passed.
as subject, as continuation of #20
Example (by removing the wikipediaRefId, the entity is not taken in consideration):