kermitt2 / entity-fishing

A machine learning tool for fishing entities
http://nerd.readthedocs.io/
Apache License 2.0
250 stars 24 forks source link

entities supplied to the query shall taken in consideration also in presence of only wikidata id #23

Closed lfoppiano closed 6 years ago

lfoppiano commented 7 years ago

as subject, as continuation of #20

Example (by removing the wikipediaRefId, the entity is not taken in consideration):

   {
       "text": "Austria invaded and fought the Serbian army at the Battle of Cer and Battle of Kolubara beginning on 12 August.",
       "language": {
           "lang": "en"
       },
       "entities": [
            {
               "rawName": "German Army",
               "offsetStart": 1107,
               "offsetEnd": 1118,
               "wikipediaExternalRef": 11702744,
               "wikidataId": "Q701923"
            }
       ]
   }
kermitt2 commented 7 years ago

In addition, even without wikipediaExternalRef and wikidataId the entity should be "forced" also as user-specified mention (the mention has to be present in the result, disambiguated or not). I think this is actually necessary if we want to plug entity-fishing to GERBIL, as the evaluations are made only on entity disambiguation, with the mentions given in advance.

lfoppiano commented 6 years ago

Here the mentions are marked as coming from an user only when the wikipediaExternalRef is present.

Few questions:

  1. are them to be kept as long as they are in the input list of entities?
  2. What if the mention (no disambiguated) is not in the text in input, as in the example above?

What is the approach: a. more constraints:

OR

b. less constrains: the user will input anything, mention/entity with /without disambiguated information

kermitt2 commented 6 years ago
  1. are them to be kept as long as they are in the input list of entities?

I assume you are talking about the mention? I would say the mention should be kept in the final output, even if no entity is found for the mention - the entity is left "NIL"

  1. What if the mention (no disambiguated) is not in the text in input, as in the example above?

do you mean what if the mention description is not consistent with the input text? (wrong offsets?) or do you mean something else?

About a) -> no offset means not a mention, thus the entity "just here to help" should be passed through a custumization, not as provided annotation

-> for me, mention description provided by the user has to be consistent with the input text (valid offset, valid span), otherwise discarded

b) The foreseen design and requirements are as follow: to pass entity not as annotation, user should use the customization, to pass a mention with or without disambiguated entity, he should use user annotations (and annotation description has to be valid wrt input text)

lfoppiano commented 6 years ago

ok so to summarise:

  1. we do consider all entries from users and they have to be present in the output (whether they contains or not wikipedia/wikidata ids)

  2. entities provided have to be consistent with the text, alternatively use customisation.

In this case it's a neat solution, however by processing long text, and wanting to process it in chuncks (paragraphs or group of sentences) it would require to supply the whole text for each query, instead of supplying only the part to be processed with the previously obtained annotations.

I'm fine to keep it as it is now, just to be sure we are on the same page.

kermitt2 commented 6 years ago

We can send only a part to be processed (and not the whole text) and provide the previously obtained entities in the customization as context - that was actually the idea for processing a long text, the client takes care of the segmentation (in paragraphs) and provides a context for the segment via the customization.

lfoppiano commented 6 years ago

Ok then the point number 2 cannot be applied as it is...

We could then say: mentions have to be consistent with the text where entities can be also "there to help".

lfoppiano commented 6 years ago

after late discussion and thinking, we can summarise:

when a mention or an entity is not correctly aligned with the text (wrong offset) it's ignored

lfoppiano commented 6 years ago

Regarding the example at the first comment, the entity is ignored as it's outside the boundaries of the text (there is an WARN displayed in the log) and is removed from the response.

tantikristanti commented 6 years ago

The test for this issue was done as follows:

  1. Test case: there isn't an entity supplied by user in order to see the list of mentions and entities given by the service With the default format as follow:
    {
    "text": "Austria invaded and fought the Serbian army at the Battle of Cer and Battle of Kolubara beginning on 12 August. ",
    "shortText": "",
    "termVector": [],
    "language": {
        "lang": "en"
    },
    "entities": [],
    "mentions": [
        "ner",
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "customisation": "generic"
    }

    a. the mention is -> name + offsets in the text Example: August b. the entity is -> name + ids (wikipedia or wikidata) + offset Example: Austria, Serbian army, Battle of Cer, Battle of Kolubara

    "entities": [
        {
            "rawName": "Austria",
            "type": "LOCATION",
            "offsetStart": 0,
            "offsetEnd": 7,
            "nerd_score": 1,
            "nerd_selection_score": 0.761,
            "wikipediaExternalRef": 26964606,
            "wikidataId": "Q40",
            "domains": [
                "Atomic_Physic",
                "Engineering",
                "Administration",
                "Geology",
                "Oceanography",
                "Earth"
            ]
        },
        {
            "rawName": "Serbian army",
            "offsetStart": 31,
            "offsetEnd": 43,
            "nerd_score": 0.7854,
            "nerd_selection_score": 0.6443,
            "wikipediaExternalRef": 10072531,
            "wikidataId": "Q1209256",
            "domains": [
                "Military"
            ]
        },
        {
            "rawName": "Battle of Cer",
            "offsetStart": 51,
            "offsetEnd": 64,
            "nerd_score": 0.7854,
            "nerd_selection_score": 0.6357,
            "wikipediaExternalRef": 1614762,
            "wikidataId": "Q697748",
            "domains": [
                "Military"
            ]
        },
        {
            "rawName": "Battle of Kolubara",
            "offsetStart": 69,
            "offsetEnd": 87,
            "nerd_score": 0.7854,
            "nerd_selection_score": 0.6312,
            "wikipediaExternalRef": 2167279,
            "wikidataId": "Q682699",
            "domains": [
                "Military"
            ]
        },
        {
            "rawName": "August",
            "type": "PERIOD",
            "offsetStart": 104,
            "offsetEnd": 110,
            "nerd_score": 0.8,
            "nerd_selection_score": 0
        }
    ]
    }
  1. Test case: there is an entity supplied by user

Let's take an example to force the entity of German army to be forced into the mention and entity of Serbian army in the text. For instance, it has been chosen the disambiguation of German army with the wikipedia page 11702744 and wikidata id Q701923 into the mention Serbian army

a. with correct offsets

b. with incorrect offsets

tantikristanti commented 6 years ago

This issue is closed with the reason that all test cases are passed.