Unexpected category in czech-cnec2.0-200831 model

ufal / nametag

NameTag: Named Entity Tagger

Mozilla Public License 2.0

38 stars 10 forks source link

Unexpected category in czech-cnec2.0-200831 model #13

Closed matyaskopp closed 3 years ago

matyaskopp commented 3 years ago

For this sentence:

Dobrý den, dámy a pánové, já bych si dovolil ještě navrhnout jednu změnu v pevném zařazení, a to konkrétně v bodu 68, sněmovní tisk 51, Výroční zprávy a účetní závěrky zdravotních pojišťoven za rok 2012, a to na pátek 14. 2. po bloku třetích čtení.

nametag returns unexpected category C - Bibliography container, this category is not defined in https://ufal.mff.cuni.cz/~strakova/cnec2.0/ne-type-hierarchy.pdf

foxik commented 3 years ago

The referenced image describes only the entity types; apart from them, the hierarchy also includes container NEs, described in Chapter 2 of the technical report Ševčíková et al., 2007 referenced in CNEC 1/CNEC 2 description.

The container NEs are:

P for (complex) person names,
T for temporal expressions,
A for addresses,
C for bibliographic items

I agree the documentation on the web could mention them directly instead of referencing the report.

matyaskopp commented 3 years ago

Missing container types in the description is confusing. And "super-types" that are distinguished from container with capital letter too...

So if you can please rephrase this sentence: The corpus uses 46 named entity types, which can be nested. (https://ufal.mff.cuni.cz/nametag/2/models) Yes, it means that CNEC contains exactly 46 types that can be nested - it is true. It does not exclude the possibility of other types. But careless reader (me) expects that the sentence covers all possible entities.

stranak commented 3 years ago

I agree it would be better to clarify the model description:

The corpus uses 46 named entity types, which can be nested.

I would change it into something like this (using and modifying the paper in references):

The corpus uses 46 atomic named entity types, which can be embedded, e.g., the river name can be part of a name of a city as in <gu =Ústí nad <gh Labem>>). There are also 4 so-called contaner NEs: two or more NEs are parts of a container NE (e.g., two NEs, a first name and a surname, form together a person name container NE such as in <P >). The 4 container NEs are marked with a capital one-letter tag: P for (complex) person names, T for temporal expressions, A for addresses, and C for bibliographic items.

strakova commented 3 years ago

Thanks for your suggestions and sorry for the confusion. I explained the existence of the NE containers in both CNEC and NameTag2 documentations.