proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

relation between lang-annotation, token-annotation and metadata language #62

Open kosloot opened 5 years ago

kosloot commented 5 years ago

Ok, this is a bit fuzzy, and maybe more a ucto or frog thing, but:

In the annotations of FoLiA documents you can have:

<lang-annotation set="http://raw.github.com/proycon/folia/master/setdefinitions/iso639_3.foliaset"/>

and

<token-annotation annotator="ucto" annotatortype="auto" datetime="2017-06-27T17:05:09" set="tokconfig-nld"/>

but also meta-data:

<meta id="language">nld</meta>

And, in normal use, there may be 'lang' nodes connected to structure elements:

<lang class="nld" set="http://raw.github.com/proycon/folia/master/setdefinitions/iso639_3.foliaset"/>

All these are more or less related. Ucto (and indirectly Frog) assumes that the class attributes refers a language which is to be present in a tokconfig-* file like tokconfig-nld which is also to be named in a token-annotation declaration. The class itself should also be present in the set named by lang-annotation

The meta field is assumed to be the default language of the document.

These are all 'weak' relations. some of them are enforced, or only sometimes. Real checking is only partial done.

For instance in de Huygens data, we see nodes like:

<lang class="UNKNOWN" set="http://raw.github.com/proycon/folia/master/setdefinitions/iso639_3.foliaset"/>

I wonder if ''UNKNOWN" is present in the iso693_3 set, but it is certainly NOT available as an ucto configuration file. At the moment ucto/frog will ignore this class, and use the default class (probably 'nld') to tokenise the text at hand. This is questionable.

So... Is it possible/desirable to have a more tight coupling? I mean: enforcing mentioned languages to be available in both ISO and ucto? And use the '\<meta>` language as the default.

As a sidenote: Older FoLiA versions refer to token-annotations like:

<token-annotation annotator="ucto" annotatortype="auto" datetime="2017-06-27T17:05:09" set="tokconfig-nl"/>

(so 'nl' not 'nld' ). This leads to various problems, like adding the 3-letter version too, which has a side-effect: There is NO default token-annotation anymore. But this not directly solvable,. Just a nuisance.

proycon commented 5 years ago

The fact that ucto encodes the language in the set name is indeed not something to rely on, as it's just a convention of ucto itself. I suppose it can only be used it to determined by ucto/frog to see if it already tokenised something itself.

The best identification is the <lang-annotation> against a predefined set (like iso-639-3), as it's most fine-grained, and ucto could easily map this to our own configuration filename convention. One could fall back to the language field in the metadata, but the disadvantage there is that FoLiA's native metadata scheme is limited and doesn't strictly constrain the values (so there's no guarantee this is iso-639-3). We're deliberately limited there because other metadata schemes (CMDI, dublin core) solve this problem already and can be used with FoLiA just as well.