kermitt2 / grobid-ner

A Named-Entity Recogniser based on Grobid.
https://grobid-ner.readthedocs.io
Apache License 2.0
49 stars 11 forks source link

Annotation consistancy #32

Closed wigdan closed 7 years ago

wigdan commented 7 years ago

In order to be consistant, what would be the correct annotation to adopt for NPs including "NATIONAL" as Austalain, Indonesian, "TITLE" as Prime Minster, President and "PERSON" as Malcom Turnbull, Ioki Widodo

Below, are 2 examples from corpus:

 <ENAMEX type="NATIONAL">Australian</ENAMEX> <ENAMEX type="TITLE">Prime Minister</ENAMEX>
 <ENAMEX type="PERSON">Malcolm Turnbull</ENAMEX>

<ENAMEX type="NATIONAL">Indonesian president</ENAMEX>
 <ENAMEX type="PERSON">Joko Widodo</ENAMEX>

Should we stress on "TITLE" and annotate titles such as Prime Minster as in the first example and also "NATIONAL" for Australian, OR Stress on "NATIONAL" and thus annotate only Indonesian as "NATIONAL" in Indonesian president

kermitt2 commented 7 years ago

So following the longest entity match principle, it would be:

<ENAMEX type="PERSON">Australian Prime Minister Malcolm Turnbull</ENAMEX>

<ENAMEX type="PERSON">Indonesian president Joko Widodo</ENAMEX>

The overall type is given by the entity which is the semantic head of the NP.

I know this principle is not exciting for your examples, but otherwise there would not be clear criteria for driving "flat" annotations.

Just to be clear: the best solution would be to annotate exhaustively/recursively the entity and the sub-entities components. However we could not train a sequence labeller on this and it would be a pain for annotators. The longest entity matching principle allows to decide what to annotate in a consistent manner, makes the NER module easy to train and to apply for further disambiguation, and it is robust wrt agglutinations.

In addition, it will always be possible to reapply a NER trained a bit differently to analyse the sub-entities of a "multi-word expression" entity - if necessary.

everzeni commented 7 years ago

what about cases like German-occupied Poland, in which occupied is not a NE. Do we annotate two separated entities?:

<ENAMEX type="NATIONAL">German</ENAMEX>-occupied <ENAMEX type="LOCATION">Poland</ENAMEX>

in the same spirit: Republican presidential candidate Donal Trump (presidential candidate not a NE)

kermitt2 commented 7 years ago

I think to be consistent indeed (still based on the longest match principle), we should annotate two different entities in each of your two examples, as you did for the first one.

everzeni commented 7 years ago

Ok thanks!

What about coordinations, for example:

Though the vast majority of the Jews affected and killed during Holocaust were of Ashkenazi descent, Sephardi and Mizrahi Jews suffered greatly as well.

1) <ENAMEX type="PERSON_TYPE">Sephardi</ENAMEX> and <ENAMEX type="PERSON_TYPE">Mizrahi Jews</ENAMEX>
or
2) <ENAMEX type="PERSON_TYPE">Sephardi and Mizrahi Jews</ENAMEX>
?
kermitt2 commented 7 years ago

Ah yes this is a point to raise!

I would go with the second choice which is simpler and more consistent with the above-mentioned stuff I think - the idea that we consider the "global" entity only.

For the next step of disambiguation, coordinations are anyway a special case.

everzeni commented 7 years ago

I have a doubt in the following case:

During (...) the war some 900 Jews (...) passed through the Banjica concentration camp

1) <ENAMEX type="PERSON_TYPE">900 Jews</ENAMEX>
or
2) <ENAMEX type="MEASURE">900</ENAMEX> <ENAMEX type="PERSON_TYPE">Jews</ENAMEX>
?
lfoppiano commented 7 years ago

I would say the second one.

kermitt2 commented 7 years ago

Largest entity mention principle -> 1)

lfoppiano commented 7 years ago

well, in this case 900 isn't the quantity?

kermitt2 commented 7 years ago

semantic head is "Jews", so the whole NP would be PERSON_TYPE - I am applying the annotation principle blindly :D

lfoppiano commented 7 years ago

I still think, for this specific case (MEASURE), the quantity is not a characteristic of the specific entity (the name of the prime minister for example, the fact that it's australian).

Intuitively seems all the quantification/measure near a NE shall be annotated separately.

See previously annotated examples:

was O
a   O
genocide    O
in  O
which   O
Adolf   B-PERSON
Hitler  PERSON
'   O
s   O
[...]
and O
its O
collaborators   O
killed  O
about   O
six B-MEASURE
million MEASURE
Jews    B-PERSON_TYPE
.   O

or

outnumbered O   O
the O   O
British ORGANISATION    military_organization/N1
Army    ORGANISATION    military_organization/N1
at  O   O
the O   O
beginning   O   O
of  O   O
the O   O
war O   O
;   O   O
about   O   O
1   MEASURE rational_number/N1
.   MEASURE rational_number/N1
3   MEASURE rational_number/N1
million MEASURE rational_number/N1
Indian  NATIONAL    jurisdictional_cultural_adjective/J1
soldiers    O   O
and O   O
labourers   O   O
served  O   O
in  O   O
kermitt2 commented 7 years ago

"900 Jews", "British Jews", "Jews", it's all semantically a group of persons, it's the definition of the semantic head. The quantity 900 is a modifier, similar as a determiner or an adjective. I think it's hard to motivate an exception in the largest entity principle for MEASURE, because how can we define which attributes are characteristic or not of an entity?

What about a case like the "President of the 500 senators".

lfoppiano commented 7 years ago

I think that the classMEASURE is actually the exception to this case (at least the one I can think of now). When MEASURE is happearing as a modifier / quantifier should not be annotated with the largest entity matching.

For example the 2 Bill Kill movies had been a success or the two presidents of the republic.

In the example you mentioned 500 doesn't means only the number, but it's also the named entity (like the assembly of the 10), could be considered an 'institution' by its own so I think in this case is not to be annotated as MEASURE.

kermitt2 commented 7 years ago

But why? What is the motivation/reason to consider quantities as exception in the largest entity principle?

Other example then: "the doctor of the two presidents of the republic"

lfoppiano commented 7 years ago

The reason is that it's a modifier of the quantity, not a characteristic of the entity. Anyway let's drop this discussion and not consider this exception, even though I'm not conviced :-)

everzeni commented 7 years ago

Following our discussion (on May 2), we decided to make an exception to the Longest match entity rule: we will annotate separately MEASURE only when it is at the beginning of the NP, for example:

<ENAMEX type="MEASURE">45</ENAMEX><ENAMEX type="PERSON">presidents of the United States</ENAMEX>
ebenaissa commented 7 years ago

Following the paragraph :

Meanwhile, <ENAMEX type="PERSON">Nigel Farage</ENAMEX>, <ENAMEX type="TITLE">leader</ENAMEX>
of the anti-<ENAMEX type="INSTITUTION">EU UKIP</ENAMEX> stood down after his party's
long-term ambition had been accomplished.

I have three questions: 1) Following the largest entity rule should we annotate "leader of the anti-EU UKIP" as PERSON like "president of the US" ? 2) If we annotate "leader of the EU UKIP" as PERSON, should we group it with the precedent entity "Nigel Farage" (with the comma) ? 3) and if we annotate separatly with EU UKIP as INSTITUTION the word "leader" alone is an TITLE entity ?

lfoppiano commented 7 years ago

@ebenaissa following the largest entity match I would annotate the entire entity:

Meanwhile, <ENAMEX type="PERSON">Nigel Farage, leader of the anti-EU UKIP</ENAMEX> stood down after his party's long-term ambition had been accomplished.

However, I'm wondering, if having the comma between Nigel Farage and the leader of... has to be considered a strong motivation to split the entity. If the answer is yes, then it would be something like:

Meanwhile, <ENAMEX type="PERSON">Nigel Farage</ENAMEX>, <ENAMEX type="TITLE">leader of the anti-EU UKIP</ENAMEX> stood down after his party's long-term ambition had been accomplished.

I have annotated leader of the anti-UE UKIP as TITLE as we've discussed about the idea of considering TITLE prioritary over PERSON (see President of the US) - as it's a specificity of it (to be decided in #33 I think)

Regarding your third point, EU is an institution, but UKIP is an ORGANISATION I think (a group of people)

lfoppiano commented 7 years ago

After a short discussion we agreed that the comma is not a motivation for the entity to be splitted.

The annotation is then:

Meanwhile, <ENAMEX type="PERSON">Nigel Farage, leader of the anti-EU UKIP</ENAMEX> stood down after his party's long-term ambition had been accomplished.

The same sentence could be inverted and annotated in the same way:

Meanwhile, <ENAMEX type="PERSON">the leader of the anti-EU UKIP, Nigel Farage</ENAMEX> stood down after his party's long-term ambition had been accomplished.

or (without comma):

Meanwhile, <ENAMEX type="PERSON">the leader of the anti-EU UKIP Nigel Farage</ENAMEX> stood down after his party's long-term ambition had been accomplished.

@kermitt2 what do you think?

kermitt2 commented 7 years ago

I think comma here has the same role as a functional word, it introduces an apposition. So I would use the normal longest entity principle similarly as of in President of the US which does not split the entity.

lfoppiano commented 7 years ago

OK, so you are saying not to split the entity 👍

everzeni commented 7 years ago

I add here the example from issue #36, for reference

While O attending O the O May B-EVENT 2012 EVENT NATO EVENT summit EVENT meeting EVENT

everzeni commented 7 years ago

for the doc of longest entity match, also mention this

wigdan commented 7 years ago

to continue with the coordination issue raised up here:

Though the vast majority of the Jews affected and killed during Holocaust were of Ashkenazi descent, Sephardi and Mizrahi Jews suffered greatly as well.

Where we opted for this annotation:

<ENAMEX type="PERSON_TYPE">Sephardi and Mizrahi Jews</ENAMEX>

what would be the correct annotation for this structure:

<ENAMEX type="MEASURE">first</ENAMEX/> time a party other than the <ENAMEX type="PERSON_TYPE">Conservatives</ENAMEX> or <ENAMEX type="PERSON_TYPE">Labour</ENAMEX> had topped a nationwide poll in <ENAMEX type="PERIOD">108 years</ENAMEX>

OR

<ENAMEX type="MEASURE">first</ENAMEX/> time a party other than the <ENAMEX type="PERSON_TYPE">Conservatives or Labour</ENAMEX> had topped a nationwide poll in <ENAMEX type="PERIOD">108 years</ENAMEX>

lfoppiano commented 7 years ago

To answer your question there was a comment regarding it in this comment above.

everzeni commented 7 years ago

But I don't think it's exactly the same problem, in the comment above we had a coordination of two entities (Sephardi and Mizrahi) both related to another one (Jews). In the issue Wigdan raises, we have two well separated entities (Conservatives / Labour).

I don't think we should annotate them together. Imagine if we had : New-York and Paris are big cities, would we annotate them as one entity?

wigdan commented 7 years ago

i agree, that's why i introduced the issue as "raised up"... so, if i understand well, we annotate the coordination between "Conservative or Labour" as PERSON_TYPE. That was my intuition, but i wanted to double check.

wigdan commented 7 years ago

that was a response to Luca!

lfoppiano commented 7 years ago

Don't worry I understood. I think @everzeni is right.

wigdan commented 7 years ago

i think they talk about "Party" (first time a party other than the Conservative or Labour)

<ENAMEX type="MEASURE">first</ENAMEX/> time a party other than the <ENAMEX type="PERSON_TYPE">Conservatives or Labour</ENAMEX> had topped a nationwide poll in <ENAMEX type="PERIOD">108 years</ENAMEX>

I thought then the structure means (Conservative or Labour Party), it thus joins the analysis of (Sephardi and Mizrahi Jews), where the head of the NP is Jews, Sephardi and Mizrahi come as modifiers.

everzeni commented 7 years ago

I agree that if we had Conservative or Labour party we would annotate it as one entity. But I would say it's different here because:

kermitt2 commented 7 years ago

Personally I would apply the largest entity principle strictly and in a uniform manner to allow uniform cascading the NER in the entity, so:

<enamex type="LOCATION">New-York and Paris</enamex> are big cities.
... a party other than the <ENAMEX type="PERSON_TYPE">Conservatives or Labour</ENAMEX> had topped a nationwide poll...

This avoids relatively difficult analysis about what is an independent entity or not.

lfoppiano commented 7 years ago

From my point of view looks fine

everzeni commented 7 years ago

I reopen this issue briefly, because it seems to us that we were wrong before about:

<ENAMEX type="NATIONAL">German</ENAMEX>-occupied <ENAMEX type="LOCATION">Poland</ENAMEX>

It seems to us now that the whole thing should be annotated as one LOCATION, because it refers to one territory with clear delimitations (set by Germany after it invaded Poland). Is it ok with you?

kermitt2 commented 7 years ago

Absolutely @everzeni !