proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

[proposal] Adding an "explicit" XML serialisation form to facilitate the job of parsers #84

Closed proycon closed 4 years ago

proycon commented 4 years ago

Some of the logic in FoLiA is opaque when when looking at the XML structure and is instead specifically handled by the libraries. This concerns for example the handling of default values, most of which are resolved by taking the declaration section into account:

Making these implicit makes sense as not doing so leads to a great amount of redundancy with even bigger FoLiA files and memory footprint. It does, however, come with a downside: parsers need to implement this logic if they want to fully parse a document.

I propose we implement an alternative "explicit" XML serialisation form (which is just a small superset of the normal form) that makes all these implicit details explicit in the XML to facilitate the job for simpler parsers. It is important to note that both serialisations describe the exact same model and can always be converted from one to the other. The folia validators (foliavalidator, folialint) could perform this conversion if requested.

One other possibility I'm still on the fence about is explicitly linking from word tokens to their span annotations, using for instance a <spanref id="entity.1"> in the scope of every <w> that is part of that entity. This would be the reverse of the <wref>.

The aim of this explicit form is to help parsers, especially those not implementing the full FoLiA logic. The default serialisation remains the 'normal' one. Parsers that can not deal with a document in normal form should themselves invoke foliavalidator/folialint to do the conversion to explicit form prior to parsing it themselves. I'd rather not see this burden shifted to end-users.

kosloot commented 4 years ago

As such, this may be an excellent idea. I assume that document with form="explicit" or typegroup=".." attributes are no longer validated correctly by folialint?

I suggest adding an issue about this to libfolia.

proycon commented 4 years ago

I already implemented it in libfolia (the commit references this issue) ;) Still pending release though.

kosloot commented 4 years ago

A. I overlooked that. So you took a shortcut by just ignoring the new attributes?

A more solid solution would of course be to add an explicit mode to libfolia/folialint and have those in the output too, when desired. But not a real showstopper.

proycon commented 4 years ago

Yes, the new attributes by definition don't convey any information that needs to be parsed by the libraries, so I can just ignore them. The most important thing is that documents in explicit mode validate properly.

Adding the possibility for explicit serialisation in libfolia is an option yes, but not really a priority or requirement.