proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

missing set in annotation declaration #54

Closed kosloot closed 5 years ago

kosloot commented 6 years ago

Both the C++ and the Python implementation seem to accept annotation declarations without a set, from example.xml: <token-annotation annotator="ilktok" annotatortype="auto" />

The documentation states: The set attribute is mandatory

with a footnote: Technically, it can be omitted, but then the set defaults to “undefined”. This is allowed for flexibility and less explicit usage of FoLiA in limited settings, but not recommended!

I think this to lax, and set names should be mandatory unconditionally. For instance: We run into trouble when a module would like to add another token-annotation. per definition there is no default set anymore then, but it is rather complicated or impossible to assign a set to the already existing tokens, to distinguish those from the newly added ones.

afik, these nameless declaration are quite rare, probably only in testfiles??? We could investigate this, but NOT allowing this is important.

proycon commented 6 years ago

Agreed that this is too lax and we should probably remove this behaviour. It is used in practise though so it would be a change from FoLiA v1.6 forward then as we can't demand this from older versions due to backward compatibility. On a related note: I thikn we should also be strict in demanding declarations for structural elements (token, paragraph, sentence), these are now optional, but if you really want the declarations to be meaningful they'd better be strict too.

For the lazy users we can provide a tool that automatically generates some ad-hoc declarations.

kosloot commented 6 years ago

yes, let's do this:

I am still a bit reluctant towards making declarations mandatory. But in the long run it might be needed. So start requiring this too. A conversion script adding some default declarations might be more difficult though .

proycon commented 6 years ago

Just for clarity: of course the libraries should still remain capable of parsing pre-1.6 documents (with the missing setnames and all). The upgrade script is not a replacement for that but just an additional tool.

proycon commented 5 years ago

We do have something extra to consider; for certain annotation types set is optional (this applies to a lot of structure elemnets), or in rare cases not present at all perhaps even. In such cases a declaration without set is permitted.

I also want to enforce that if there is a set, then there must be a class on the annotations (and obviously if there is no set, there can't be a class on annotations).

proycon commented 5 years ago

In summary, for FoLiA 2.0: