[proposal] Adding an "explicit" XML serialisation form to facilitate the job of parsers

proycon commented 4 years ago

Some of the logic in FoLiA is opaque when when looking at the XML structure and is instead specifically handled by the libraries. This concerns for example the handling of default values, most of which are resolved by taking the declaration section into account:

the default set for a particular annotation is by obtained by looking at the declarations.
- likewise for the default processor or in FoLiA v1 the default annotator/annotatortype
the default class for text content <t> is current.
the set of a span layer is implicit from the span annotations within

Making these implicit makes sense as not doing so leads to a great amount of redundancy with even bigger FoLiA files and memory footprint. It does, however, come with a downside: parsers need to implement this logic if they want to fully parse a document.

I propose we implement an alternative "explicit" XML serialisation form (which is just a small superset of the normal form) that makes all these implicit details explicit in the XML to facilitate the job for simpler parsers. It is important to note that both serialisations describe the exact same model and can always be converted from one to the other. The folia validators (foliavalidator, folialint) could perform this conversion if requested.

The explicit form is declared by the attribute form="explicit" in the FoLiA tag. When form is not set to explicit (or absent) altogether, behaviour is unchanged as normal.
Defaults are made explicit:
- All text-content elements explicitly declare their class (so <t> will become <t class="current">)
- All annotations that carry a set have a set attribute, sets never refer to aliases.
- All annotations associated with a processor have a processor attribute.
- Layers carry a set attribute if the span elements within carry a set.
Certain FoLiA internals are made explicit:
- All annotation elements get a typegroup attribute that makes explicit what kind of annotation element we are dealing with. Values are: structure, inline, span, higherorder, markup, layer. So <w> becomes <w typegroup="structure">, <pos> becomes <pos typegroup="inline">. This allows for example xpath expressions like: give me the deepest structural ancestor.
- Non-authoritativeness is expressed explicitly (this reverses #56, but only for explicit form of course)
IDs are mandatory on all annotation elements (the library will auto-generate IDs where needed, and these IDs will adhere to some kid of convention that marks them as autogenerated (two leading and trailing underscores for example?) so they can be stripped again if needed) (idea discarded for explicit mode because it would add extra information not present in the normal form)
Serialise predefined features/subsets explicitly using 'feat' elements, do not use the attribute shortcut.

One other possibility I'm still on the fence about is explicitly linking from word tokens to their span annotations, using for instance a <spanref id="entity.1"> in the scope of every <w> that is part of that entity. This would be the reverse of the <wref>.

The aim of this explicit form is to help parsers, especially those not implementing the full FoLiA logic. The default serialisation remains the 'normal' one. Parsers that can not deal with a document in normal form should themselves invoke foliavalidator/folialint to do the conversion to explicit form prior to parsing it themselves. I'd rather not see this burden shifted to end-users.

kosloot commented 4 years ago

As such, this may be an excellent idea. I assume that document with form="explicit" or typegroup=".." attributes are no longer validated correctly by folialint?

I suggest adding an issue about this to libfolia.

proycon commented 4 years ago

I already implemented it in libfolia (the commit references this issue) ;) Still pending release though.

kosloot commented 4 years ago

A. I overlooked that. So you took a shortcut by just ignoring the new attributes?

A more solid solution would of course be to add an explicit mode to libfolia/folialint and have those in the output too, when desired. But not a real showstopper.

proycon commented 4 years ago

Yes, the new attributes by definition don't convey any information that needs to be parsed by the libraries, so I can just ignore them. The most important thing is that documents in explicit mode validate properly.

Adding the possibility for explicit serialisation in libfolia is an option yes, but not really a priority or requirement.

proycon / folia

[proposal] Adding an "explicit" XML serialisation form to facilitate the job of parsers #84