proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

May a processor be assigned to a <text> element? #96

Closed kosloot closed 3 years ago

kosloot commented 3 years ago

In FoLiA-abby (from foliautils) I thought to be wise to also assign a processor to the top level <text> node. But the resulting FoLiA is rejected by foliavalidator. folialint doesn't complain.

> foliavalidator text.xml
VALIDATION ERROR on full parse by library (stage 2/3), in text.xml
ValueError: Unable to set processor on Text. AnnotationType is None!

Isn't this allowed, is it even wise to do, and so: How?

example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="FA-auch" generator="libfolia-v2.9" version="2.5.0">
  <metadata type="native">
    <annotations>
      <paragraph-annotation set="FoLiA-abby-set">
        <annotator processor="FoLiA-abby.1"/>
      </paragraph-annotation>
      <division-annotation set="FoLiA-abby-set">
        <annotator processor="FoLiA-abby.1"/>
      </division-annotation>
      <text-annotation set="FoLiA-abby-set">
        <annotator processor="FoLiA-abby.1"/>
      </text-annotation>
    </annotations>
    <provenance>
      <processor xml:id="FoLiA-abby.1" begindatetime="2021-05-01T09:13:07" command="FoLiA-abby -O out" folia_version="2.5.0" host="kokos" name="FoLiA-abby" user="sloot" version="0.17">
        <processor xml:id="FoLiA-abby.1.generator" folia_version="2.5.0" name="libfolia" type="generator" version="2.9"/>
      </processor>
    </provenance>
    <meta id="abby_file">auch.xml</meta>
  </metadata>
  <text xml:id="FA-auch.text" processor="FoLiA-abby.1">
    <div xml:id="FA-auch.text.div.1">
      <p xml:id="FA-auch.text.div.1.p.1">
        <t class="OCR">Some text</t>
        <feat class="Justified" subset="par_align"/>
      </p>
    </div>
  </text>
</FoLiA>
kosloot commented 3 years ago

To add some context: In FoliaPY, adding a processor is explicitly forbidden when the ANNOTATIONTYPE is None:

line 1087 of main.py reads:

        if self.ANNOTATIONTYPE is None:
            raise ValueError("Unable to set processor on " + self.__class__.__name__ + ". AnnotationType is None!")

Such a test is not implemented in libfolia. But that could easily be added.

So the problem boils down to: What is the reason for this restriction on <text>? I somewhat understand that <text> doesn't have an ANNOTATIONTYPE, but why forbid it to have a processor assigned.

kosloot commented 3 years ago

To explore this further: My intention is, to register which tool is the original creator of the FoLiA document, including <text> and deeper nodes. For the deeper nodes like <div> and <p> this is no problem, a processor can be assigned, but not for <text>.

But this made me think a bit further, and I would suggest to add some mechanism in FoLiA to register the original creator of the document. (lot of XML tools have such a label somewhere) This could be a simple mentioning in the metadata referring to the processor that is the original creator. Or more fancy to have a reserved tag <creator> in the provenance, which is simply a processor, but with a 'primus inter pares' role.

Still, it would be necessary to allow for adding this 'processor' or 'creator' to ALL tags. @proycon does this make sense?

proycon commented 3 years ago

I somewhat understand that doesn't have an ANNOTATIONTYPE, but why forbid it to have a processor assigned.

Yes, it's the root body element (either <text> or <speech>) which does not have a separate annotation type of its own. Since it has no specific declaration there's no way to attach a default processor. An explicit one as you suggest might be possible, but it may be a bit redundant, let's get into the next point:

But this made me think a bit further, and I would suggest to add some mechanism in FoLiA to register the original creator of the > document. (lot of XML tools have such a label somewhere) This could be a simple mentioning in the metadata referring to the processor that is the original creator. Or more fancy to have a reserved tag in the provenance, which is simply a processor, but with a 'primus inter pares' role.

Well, technically the 'creator' of a document is by definition the first processor in the provenance chain (assuming it's a complete provenance chain). The order of processors is significant so simply grabbing the first <processor> should already give you want you want I think? The assumption is that the first processor is always the one that generated the FoLiA document (including the initial body text/speech tag). I don't think we need a special tag for that. Similarly, to find the latest processor you grab the last one in the provenance chain.

There's of course also always room for arbitrary metadata fields like <meta id="creator"></meta>, but that's probably not what you're looking for here.

kosloot commented 3 years ago

I agree that it may be a bit redundant, but still, when more processors are involved, it looks a bit strange that <text> and <speech> don't name a processor. (are there other tags wiht AnnotationType none ?

In general, all processors are involved in some form of annotation ( tokenization, POS tagging, etc). Generation is a bit different, and indeed is the first step by definition.

(assuming it's a complete provenance chain).

This is an important mentioning, as in practice this not the case for a lot of FoLiA 'in the wild'. But we cannot change that. Just be more scrutinous in the future.

The assumption is that the first processor is always the one that generated the FoLiA document (including the initial body text/speech tag).

Ok, that makes sense, and I didn't realize that. But I think this is implemented correctly in all Provenance, so we are good there.

Nevertheless I think it would be nice to clarify things, and have a <meta id="creator"></meta> in the metadata, or even a somewhat more obligatory construct in the provenance to make explicit that we know who/what constructed the original file.

I will add the restriction on AnnotationType to libfolia too, to get folialint and libfolia in line. And also add a <meta id="creator"></meta> in the future. (I assume this can be done automagicly, when adding the first processor on a NEW document.)

kosloot commented 3 years ago

And also add a in the future. (I assume this can be done automagicly, when adding the first processor on a NEW document.)

For now, I ditched this plan. It adds more trouble than worthwhile. (for instance the need to parse all <meta> nodes before the provenance)

Closing this issue now.