proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Problem with XML printing of annotator and annotatortype in Comment #73

Closed ceramisch closed 5 years ago

ceramisch commented 5 years ago

In PARSEME, we have a script developed by @silvioricardoc that converts cupt format (PARSEME's CoNLLU-plus version) to folia format. However, the script is not correctly converting annotator and annotatortype in comments. The problem is in the following method:

def to_folia(self, parent: folia.AbstractElement):
        r"""Construct a `folia.Comment` object representing this comment.
        The object will be appended to `parent`.
        """
        child = parent.append(folia.Comment, value=self.text, **self._folia_kwargs())
        for sub in self.nested:
            sub.to_folia(child)

where self._folia_kwargs() is a dict containing key=value pairs, including annotator and annotatortype:

   def generic_properties(self):
        r"""Return properties that are common to all metadata."""
        return [('annotator', self.annotator),
                ('annotatortype', self.annotatortype),
                ('datetime', self.datetime),
                ('confidence', self.confidence)]

    def _folia_kwargs(self) -> dict:
        r"""Return kwargs for FoLiA instantiation of subclasses."""
        return dict((k, v) for (k, v) in self.generic_properties() if v is not None)

The values seem to be correctly appended to the folia.Comment object but when we print the corresponding XML, the information is missing.

The problem can be reproduced by dowloading the PARSEME tools:

pip3 install --upgrade folia
git clone https://gitlab.com/parseme/utilities.git
cd utilities/st-organizers
./to_folia.py --lang PT --input ../test/data/withmetadata.conllup > tmp.folia
./to_conllup.py --lang PT --input tmp.folia > tmp.conllup
diff ../test/data/withmetadata.conllup tmp.conllup

Information about annotator and annotatortype disappeared in the first conversion from conllup to folia. This didn't happen before...

Could it be a bug in the folia library?

proycon commented 5 years ago

This looks like a backward-incompatibility problem in the new library. There is a new more elaborate mechanism for handling annotators and provenance in general. I'll look into it, because the library should retain a good backward compatibility.

proycon commented 5 years ago

I just released FoLiApy v2.1.3 which should fix this problem (please reopen if something still goes wrong).

You may already be aware of this, but just in case: make sure you the latest version of FLAT if you plan on annotating FoLiA v2 documents, older versions won't be able to handle the newer documents.

In your script, you might also want to make use of the new provenance framework, by registering your script as a processor on Document instantiation, like this for example:

doc = folia.Document(id=....., processor=folia.Processor.create("parseme-utilities", version="0.1"))

Any added annotations will then automatically be associated with the processor. It's not mandatory but might be nice to have (other FoLiA tools including FLAT will also add themselves to the provenance chain).

ceramisch commented 5 years ago

It works now, thanks @proycon