proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link
clariah clarin computational-linguistics conllu converters folia nlp

.. image:: https://github.com/proycon/foliatools/actions/workflows/foliatools.yml/badge.svg?branch=master :target: https://github.com/proycon/foliatools/actions/

.. image:: http://applejack.science.ru.nl/lamabadge.php/foliatools :target: http://applejack.science.ru.nl/languagemachines/

.. image:: https://www.repostatus.org/badges/latest/active.svg :alt: Project Status: Active – The project has reached a stable, usable state and is being actively developed. :target: https://www.repostatus.org/#active

.. image:: https://img.shields.io/pypi/v/folia-tools :alt: Latest release in the Python Package Index :target: https://pypi.org/project/folia-tools/

FoLiA Tools

A number of command-line tools are readily available for working with FoLiA, to various ends. The following tools are currently available:

All of these tools are written in Python, and thus require a Python 3 installation to run. More tools are added as time progresses.

Installation

The FoLiA tools are published to the Python Package Index and can be installed effortlessly using pip, from the command-line, type::

$ pip install folia-tools

You may need to use pip3 to ensure you have the Python 3 version. Add sudo to install it globally on your system, but we strongly recommend you use virtualenv to make a self-contained Python environment.

The FoLiA tools are also included in our LaMachine distribution <https://proycon.github.io/lamachine>_ .

Installation Troubleshooting

If pip is not yet available, install it as follows:

On Debian/Ubuntu-based systems::

$ sudo apt-get install python3-pip

On RedHat-based systems::

$ yum install python3-pip

On Arch Linux systems::

$ pacman -Syu python-pip

Usage

To obtain help regarding the usage of any of the available FoLiA tools, please pass the -h option on the command line to the tool you intend to use. This will provide a summary on available options and usage examples. Most of the tools can run on both a single FoLiA document, as well as a whole directory of documents, allowing also for recursion. The tools generally take one or more file names or directory names as parameters.

More about FoLiA?

Please consult the FoLiA website at https://proycon.github.io/folia for more!

Specific Tools

This section contains some extra important information for a few of the included tools.

Validating FoLiA documents using foliavalidator ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The FoLiA validator is an essential tool for anybody working with FoLiA. It is very important that FoLiA documents are properly validated before they are published, this ensures that tools know what to expect when they get a FoLiA document as input for processing and are not confronted with any nasty surprises that are far too common in the field. The degree of formal validation offered by FoLiA is something that sets it apart from many alternative annotation formats. The key tool to perform validation is foliavalidator (or its alternative C++ implementation folialint as part of FoLiA-utils <https://github.com/LanguageMachines/foliautils/>_).

Validation can proceed on two levels:

  1. shallow validation - Validates the full FoLiA document, checks if all elements are valid FoLiA elements, properly used, and if the document structure is valid. Checks if all the proper annotation declarations are present and if there are no inconsistencies in the text if text is specified on multiple levels (text redundancy). Note that shallow validation already does way more than validation against the RelaxNG Schema does.
  2. deep validation - Does all of the above, but in addition it also checks the actual tagsets used. It checks if all declarations refer to valid set definition and if all used classes (aka tags/labels) are valid according to the declared set definitions and if the combination of certain classes is valid according to the set definition.

Note that validation against merely the RelaxNG schema could be called naive validation and is NOT considered sufficient FoLiA validation for most intents and purposes.

Shallow validation is invoked as: $ foliavalidator document.folia.xml. Deep validation invoked as: $ foliavalidator --deep document.folia.xml.

In addition to validating, the foliavalidator tool is capable of automatically fixing certain validation problems when explicitly asked to do so, such as automatically declaring missing annotations.

Another feature of the validator is that it can get as a converter to convert FoLiA documents to explicit form <https://folia.readthedocs.io/en/latest/form.html>_ (using the --explicit parameter). Explicit form is a more verbose form of XML serialisation that is easier to parse to certain tools as it makes explicit certain details that are left implicit in normal form.

TEI to FoLiA conversion ^^^^^^^^^^^^^^^^^^^^^^^^^^

The TEI P5 guidelines (Text Encoding Initiative <https://tei-c.org/>_) specify a widely used encoding method for machine-readable texts. It is primarly a format for capture text structure and markup in great detail, but there are some facilities for linguistic annotation too. The sheer flexibility and complexity of TEI leads to many different TEI dialects, and subsequently implementing support for TEI (all-of-it) in a tool is an almost impossible task. FoLiA is more constrained than TEI with regard to structural and markup annotation, but places more focus on linguistic annotation.

The tei2folia tool performs conversion from a (sizable) subset of TEI to FoLiA, but provides no guarantee that all TEI P5 documents can be processed. Some notable things that are supported:

Specifically not supported (yet), non-exhaustive list:

FoLiA to STAM ^^^^^^^^^^^^^^^^^^^^^^^^^^

STAM <https://annotation.github.io/stam>__ is a stand-off model for text annotation that. It does not prescribe any vocabulary at all but allows one to reuse existing vocabularies. The folia2stam tool converts FoLiA documents to STAM, preserving the vocabulary that FoLiA predefines regarding annotation types, common attributes etc...

Supported:

Not supported yet:

Vocabulary conversion:

Both FoLiA and STAM have the notion of a set or annotation dataset. In FoLiA the scope of such a set is to define the vocabulary used for a particular annotation type (e.g. a tagset). FoLiA itself already defines what annotation types exist. In STAM an annotation dataset is a broader notion and all vocabulary, even the notion of a word or sentence, comes from a set, as nothing is predefined at all aside from the STAM model's primitives.

We map most of the vocabulary of FoLiA itself to a STAM dataset with ID https://w3id.org/folia/v2/. All of FoLiA's annotation types, element types, and common attributes are defined in this set.

Each FoLiA set definition maps to a STAM dataset with the same set ID (URI. The STAM set defines class key in that set, that corresponds to FoLiA's class attribute. Any FoLiA subsets (for features) also translate to key identifiers.

The declarations inside a FoLiA document will be explicitly expressed in STAM as well; each STAM dataset will have an annotation that points to it (with a DataSetSelector). This annotation has data with key declaration (set https://w3id.org/folia/v2/) that marks it as a declaration for a specific type, the value is something like pos-annotation and corresponds one-on-one to the declaration element used in FoLiA XML. Additionally, this annotation also has data with key annotationtype (same set as above) that where the value corresponds to the annotation type (lowercased, e.g. pos).

The FoLiA to STAM conversion is RDF-ready. That is, all identifiers are valid IRIs and all FoLiA vocabulary (https://w3id.org/folia/v2/) is backed by a formal ontology <https://github.com/proycon/folia/blob/master/schemas/folia.ttl>_ using RDF and SKOS.

FoLiA set definitions, if defined, are already in SKOS (or in the legacy format).

Being RDF-ready means that the STAM model produced by folia2stam can in turn be easily be exported to W3C Web Annotations. Tooling for that conversion will be provided in Stam Tools <https://github.com/annotation/stam-tools>_.

FoLiA to Salt ^^^^^^^^^^^^^^^^^^^^^^^^^^

Salt <https://corpus-tools.org/salt/> is a graph based annotation model that is designed to act as an intermediate format in the conversion between various annotation formats. It is used by the conversion tool Pepper <https://corpus-tools.org/pepper/>. Our FoLiA to Salt converter, however, is a standalone tool as part of these FoLiA tools, rather than integrated into pepper. You can use folia2salt to convert FoLiA XML to Salt XML and subsequently use Pepper to do conversions to other formats such as TCF, PAULA, TigerXML, GraF, Annis, etc... (there is no guarantee though that everything can be preserved accurately in each conversion).

The current state of this conversion is summarised below, it is however not likely that this particular tool will be developed any further:

Our Salt conversion tries to preserve as much of the FoLiA as possible, we extensively use salt's capacity for specifying namespaces to hold and group the annotation type and set of an annotation. SLabel elements with the same namespace should often be considered together.