kermitt2 / grobid-ner

A Named-Entity Recogniser based on Grobid.
https://grobid-ner.readthedocs.io
Apache License 2.0
49 stars 11 forks source link

UNKNOWN class #39

Closed ebenaissa closed 7 years ago

ebenaissa commented 7 years ago

Following the sentence :

Although the exact origin of the plague is unknown, a young boy from a village in China is identified as the plague's official patient zero.

I was wondering about patient zero, but also disease and other sequence not really belonging to any other classes but referenced on Wikipedia (the patient zero wikipedia : https://en.wikipedia.org/wiki/Index_case).

Those examples should be annotated as UNKNOWN ? And if we dont annotate those examples at all what sequence can be labeled as UNKNOWN ?

kermitt2 commented 7 years ago

First question : no all these specific but common concepts are already enumerated in Wikipedia so they will be catch by the entity disambiguation step, it's out of the scope of NER !

Named entity classes correspond more to particular classes of entities that cannot be enumerated exhaustively in advance.

Also for specialist terminology like biomedical stuff, we typically use specialized NER working on different resources and features... like... grobid-bio ;)

kermitt2 commented 7 years ago

Second question: UNKNOWN would be for proper names not covered by the other classes, it was introduced a bit as safety net - I can imagine things like name of a god, of a mythic creature, name of conference series maybe, ...

but I agree that it is very similar to CONCEPT and we could challenge it.

f-siham commented 7 years ago

I was wondering if the entity « Horizon 2020 » belongs to the UNKNOWN class since it is a research project. I think also that is between EVENT and ORGANISATION

lfoppiano commented 7 years ago

From wikipedia: "Horizon 2020 is a funding programme created by the European Union/European Commission to support and foster research in the European Research Area (ERA)."

How can we define a funding program?

I think is neither an EVENT and ORGANISATION, I would be tempted to annotate it as CONCEPT or ARTIFACT.

(I think if we don't find a class then UNKNOWN would be the most appropriated one)

What's the other think?

everzeni commented 7 years ago

I don't think it fits in any class, but for me it's a mix of EVENT, ORGANISATION/INSTITUTION and LEGAL... definitely not CONCEPT nor ARTIFACT :no_mouth:

Other examples come to mind, like Plan Marshall, or things like research projects (Parthenos, Parsiti, names of ANRs, etc.).

wigdan commented 7 years ago

For CONCEPT, we could try to eliminate if by saying the rule of final suffix (ism) doesn't apply here like in Communism or Zionism.

wigdan commented 7 years ago

CREATION does it apply only for names of movies, songs etc...and only for artistic domain? we could think of CREATION, otherwise?

lfoppiano commented 7 years ago

OK, we need to take a decision. Let's annotate it then as UNKNOWN.

kermitt2 commented 7 years ago

This example would fit well the original purpose of UNKNOWN I think (indeed like Plan Marshall, Parthenos, ...). It's not CREATION which is of artistic matter. It's not CONCEPT because a funding program is not an idea. it's not ARTIFACT because it supposed some sort of item, an embedding (even for a mental work like a software, it is embedded into an item, e.g. computer embedded invention). It's not an EVENT (it includes many events, and it is more than that).

everzeni commented 7 years ago

We have a few unresolved questions, about the following entities: 1) Final Solution, Final Solution to the Jewish Question: the Nazi plan to exterminate the Jews 2) Jewish Question 3) Antisemitism Yellowbadge logo, Yellow badge 4) Aktion T4 euthanasia programme, Aktion T4 (a mass murder programme)

UNKNOWN ?

kermitt2 commented 7 years ago

My guess :D

  1. CONCEPT, this is an idea
  2. CONCEPT, this is an idea
  3. UNKNOWN
  4. UNKNOWN it's more than just an idea
everzeni commented 7 years ago

I just annotated:

the terms "<ENAMEX type="CONCEPT">war crimes</ENAMEX>" and "<ENAMEX type="CONCEPT">
crimes against humanity</ENAMEX>" were indeed correct labels for what happened.

but I'm doubting, how does it seem?

kermitt2 commented 7 years ago

I would say this are common expressions, not named entities which are hard to enumerate. So I would not annotate war crimes and crime against humanity. These concepts will be anyway well catched by the disambiguation part of NERD using Wikipedia common knowledge.