Closed wigdan closed 7 years ago
So following the longest entity match principle, it would be:
<ENAMEX type="PERSON">Australian Prime Minister Malcolm Turnbull</ENAMEX>
<ENAMEX type="PERSON">Indonesian president Joko Widodo</ENAMEX>
The overall type is given by the entity which is the semantic head of the NP.
I know this principle is not exciting for your examples, but otherwise there would not be clear criteria for driving "flat" annotations.
Just to be clear: the best solution would be to annotate exhaustively/recursively the entity and the sub-entities components. However we could not train a sequence labeller on this and it would be a pain for annotators. The longest entity matching principle allows to decide what to annotate in a consistent manner, makes the NER module easy to train and to apply for further disambiguation, and it is robust wrt agglutinations.
In addition, it will always be possible to reapply a NER trained a bit differently to analyse the sub-entities of a "multi-word expression" entity - if necessary.
what about cases like German-occupied Poland
, in which occupied
is not a NE. Do we annotate two separated entities?:
<ENAMEX type="NATIONAL">German</ENAMEX>-occupied <ENAMEX type="LOCATION">Poland</ENAMEX>
in the same spirit: Republican presidential candidate Donal Trump
(presidential candidate
not a NE)
I think to be consistent indeed (still based on the longest match principle), we should annotate two different entities in each of your two examples, as you did for the first one.
Ok thanks!
What about coordinations, for example:
Though the vast majority of the
Jews affected and killed duringHolocaust were ofAshkenazi descent,Sephardi andMizrahi Jews suffered greatly as well.1) <ENAMEX type="PERSON_TYPE">Sephardi</ENAMEX> and <ENAMEX type="PERSON_TYPE">Mizrahi Jews</ENAMEX> or 2) <ENAMEX type="PERSON_TYPE">Sephardi and Mizrahi Jews</ENAMEX> ?
Ah yes this is a point to raise!
I would go with the second choice which is simpler and more consistent with the above-mentioned stuff I think - the idea that we consider the "global" entity only.
For the next step of disambiguation, coordinations are anyway a special case.
I have a doubt in the following case:
During (...) the war some 900 Jews (...) passed through the Banjica concentration camp
1) <ENAMEX type="PERSON_TYPE">900 Jews</ENAMEX>
or
2) <ENAMEX type="MEASURE">900</ENAMEX> <ENAMEX type="PERSON_TYPE">Jews</ENAMEX>
?
I would say the second one.
Largest entity mention principle -> 1)
well, in this case 900 isn't the quantity?
semantic head is "Jews", so the whole NP would be PERSON_TYPE
- I am applying the annotation principle blindly :D
I still think, for this specific case (MEASURE), the quantity is not a characteristic of the specific entity (the name of the prime minister for example, the fact that it's australian).
Intuitively seems all the quantification/measure near a NE shall be annotated separately.
See previously annotated examples:
was O
a O
genocide O
in O
which O
Adolf B-PERSON
Hitler PERSON
' O
s O
[...]
and O
its O
collaborators O
killed O
about O
six B-MEASURE
million MEASURE
Jews B-PERSON_TYPE
. O
or
outnumbered O O
the O O
British ORGANISATION military_organization/N1
Army ORGANISATION military_organization/N1
at O O
the O O
beginning O O
of O O
the O O
war O O
; O O
about O O
1 MEASURE rational_number/N1
. MEASURE rational_number/N1
3 MEASURE rational_number/N1
million MEASURE rational_number/N1
Indian NATIONAL jurisdictional_cultural_adjective/J1
soldiers O O
and O O
labourers O O
served O O
in O O
"900 Jews", "British Jews", "Jews", it's all semantically a group of persons, it's the definition of the semantic head. The quantity 900 is a modifier, similar as a determiner or an adjective. I think it's hard to motivate an exception in the largest entity principle for MEASURE
, because how can we define which attributes are characteristic or not of an entity?
What about a case like the "President of the 500 senators".
I think that the classMEASURE
is actually the exception to this case (at least the one I can think of now). When MEASURE is happearing as a modifier / quantifier should not be annotated with the largest entity matching.
For example the 2 Bill Kill movies had been a success
or the two presidents of the republic
.
In the example you mentioned 500 doesn't means only the number, but it's also the named entity (like the assembly of the 10
), could be considered an 'institution' by its own so I think in this case is not to be annotated as MEASURE
.
But why? What is the motivation/reason to consider quantities as exception in the largest entity principle?
Other example then: "the doctor of the two presidents of the republic"
The reason is that it's a modifier of the quantity, not a characteristic of the entity. Anyway let's drop this discussion and not consider this exception, even though I'm not conviced :-)
Following our discussion (on May 2), we decided to make an exception to the Longest match entity rule: we will annotate separately MEASURE only when it is at the beginning of the NP, for example:
<ENAMEX type="MEASURE">45</ENAMEX><ENAMEX type="PERSON">presidents of the United States</ENAMEX>
Following the paragraph :
Meanwhile, <ENAMEX type="PERSON">Nigel Farage</ENAMEX>, <ENAMEX type="TITLE">leader</ENAMEX>
of the anti-<ENAMEX type="INSTITUTION">EU UKIP</ENAMEX> stood down after his party's
long-term ambition had been accomplished.
I have three questions: 1) Following the largest entity rule should we annotate "leader of the anti-EU UKIP" as PERSON like "president of the US" ? 2) If we annotate "leader of the EU UKIP" as PERSON, should we group it with the precedent entity "Nigel Farage" (with the comma) ? 3) and if we annotate separatly with EU UKIP as INSTITUTION the word "leader" alone is an TITLE entity ?
@ebenaissa following the largest entity match I would annotate the entire entity:
Meanwhile, <ENAMEX type="PERSON">Nigel Farage, leader of the anti-EU UKIP</ENAMEX> stood down after his party's long-term ambition had been accomplished.
However, I'm wondering, if having the comma between Nigel Farage
and the leader of...
has to be considered a strong motivation to split the entity.
If the answer is yes, then it would be something like:
Meanwhile, <ENAMEX type="PERSON">Nigel Farage</ENAMEX>, <ENAMEX type="TITLE">leader of the anti-EU UKIP</ENAMEX> stood down after his party's long-term ambition had been accomplished.
I have annotated leader of the anti-UE UKIP
as TITLE
as we've discussed about the idea of considering TITLE
prioritary over PERSON
(see President of the US
) - as it's a specificity of it (to be decided in #33 I think)
Regarding your third point, EU
is an institution, but UKIP
is an ORGANISATION
I think (a group of people)
After a short discussion we agreed that the comma is not a motivation for the entity to be splitted.
The annotation is then:
Meanwhile, <ENAMEX type="PERSON">Nigel Farage, leader of the anti-EU UKIP</ENAMEX> stood down after his party's long-term ambition had been accomplished.
The same sentence could be inverted and annotated in the same way:
Meanwhile, <ENAMEX type="PERSON">the leader of the anti-EU UKIP, Nigel Farage</ENAMEX> stood down after his party's long-term ambition had been accomplished.
or (without comma):
Meanwhile, <ENAMEX type="PERSON">the leader of the anti-EU UKIP Nigel Farage</ENAMEX> stood down after his party's long-term ambition had been accomplished.
@kermitt2 what do you think?
I think comma here has the same role as a functional word, it introduces an apposition. So I would use the normal longest entity principle similarly as of in President of the US which does not split the entity.
OK, so you are saying not to split the entity 👍
I add here the example from issue #36, for reference
While O attending O the O May B-EVENT 2012 EVENT NATO EVENT summit EVENT meeting EVENT
to continue with the coordination issue raised up here:
Though the vast majority of the Jews affected and killed during Holocaust were of Ashkenazi descent, Sephardi and Mizrahi Jews suffered greatly as well.
Where we opted for this annotation:
<ENAMEX type="PERSON_TYPE">Sephardi and Mizrahi Jews</ENAMEX>
what would be the correct annotation for this structure:
<ENAMEX type="MEASURE">first</ENAMEX/> time a party other than the <ENAMEX type="PERSON_TYPE">Conservatives</ENAMEX> or <ENAMEX type="PERSON_TYPE">Labour</ENAMEX> had topped a nationwide poll in <ENAMEX type="PERIOD">108 years</ENAMEX>
OR
<ENAMEX type="MEASURE">first</ENAMEX/> time a party other than the <ENAMEX type="PERSON_TYPE">Conservatives or Labour</ENAMEX> had topped a nationwide poll in <ENAMEX type="PERIOD">108 years</ENAMEX>
To answer your question there was a comment regarding it in this comment above.
But I don't think it's exactly the same problem, in the comment above we had a coordination of two entities (Sephardi
and Mizrahi
) both related to another one (Jews
). In the issue Wigdan raises, we have two well separated entities (Conservatives
/ Labour
).
I don't think we should annotate them together. Imagine if we had : New-York and Paris are big cities
, would we annotate them as one entity?
i agree, that's why i introduced the issue as "raised up"... so, if i understand well, we annotate the coordination between "Conservative or Labour" as PERSON_TYPE. That was my intuition, but i wanted to double check.
that was a response to Luca!
Don't worry I understood. I think @everzeni is right.
i think they talk about "Party" (first time a party other than the Conservative or Labour)
<ENAMEX type="MEASURE">first</ENAMEX/> time a party other than the <ENAMEX type="PERSON_TYPE">Conservatives or Labour</ENAMEX> had topped a nationwide poll in <ENAMEX type="PERIOD">108 years</ENAMEX>
I thought then the structure means (Conservative or Labour Party), it thus joins the analysis of (Sephardi and Mizrahi Jews), where the head of the NP is Jews, Sephardi and Mizrahi come as modifiers.
I agree that if we had Conservative or Labour party
we would annotate it as one entity. But I would say it's different here because:
party
is far away from Conservatives or Labour
Conservatives
= Conservative party / Labour
= Labour partySephardi and Mizrahi Jews
: Sephardi
alone would have been only half an entityPersonally I would apply the largest entity principle strictly and in a uniform manner to allow uniform cascading the NER in the entity, so:
<enamex type="LOCATION">New-York and Paris</enamex> are big cities.
... a party other than the <ENAMEX type="PERSON_TYPE">Conservatives or Labour</ENAMEX> had topped a nationwide poll...
This avoids relatively difficult analysis about what is an independent entity or not.
From my point of view looks fine
I reopen this issue briefly, because it seems to us that we were wrong before about:
<ENAMEX type="NATIONAL">German</ENAMEX>-occupied <ENAMEX type="LOCATION">Poland</ENAMEX>
It seems to us now that the whole thing should be annotated as one LOCATION, because it refers to one territory with clear delimitations (set by Germany after it invaded Poland). Is it ok with you?
Absolutely @everzeni !
In order to be consistant, what would be the correct annotation to adopt for NPs including "NATIONAL" as Austalain, Indonesian, "TITLE" as Prime Minster, President and "PERSON" as Malcom Turnbull, Ioki Widodo
Below, are 2 examples from corpus:
Should we stress on "TITLE" and annotate titles such as Prime Minster as in the first example and also "NATIONAL" for Australian, OR Stress on "NATIONAL" and thus annotate only Indonesian as "NATIONAL" in Indonesian president