proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link
computational-linguistics corpus file-format folia language library linguistic-annotation-framework linguistics nlp python xml

FoLiA: Format for Linguistic Annotation

tests documentation lamabadge DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

Documentation | Examples | Python Library | Python Library Documentation | C++ Library | Rust Library | FoLiA-Tools | FoLiA Utilities | FLAT: Web-based Annotation environment

by Maarten van Gompel, CLST/Radboud University Nijmegen & KNAW Humanities Cluster

https://proycon.github.io/folia

FoLiA is an XML-based annotation format, suitable for the representation of linguistically annotated language resources. FoLiA's intended use is as a format for storing and/or exchanging language resources, including corpora. Our aim is to introduce a single rich format that can accommodate a wide variety of linguistic annotation types through a single generalised paradigm. We do not commit to any label set, language or linguistic theory. This is always left to the developer of the language resource, and provides maximum flexibility.

XML is an inherently hierarchic format. FoLiA does justice to this by maximally utilising a hierarchic, inline, setup. We inherit from the D-Coi format, which posits to be loosely based on a minimal subset of TEI. Because of the introduction of a new and much broader paradigm, FoLiA is not backwards-compatible with D-Coi, i.e. validators for D-Coi will not accept FoLiA XML. It is however easy to convert FoLiA to less complex or verbose formats such as the D-Coi format, or plain-text. Converters are provided.

The main characteristics of FoLiA are:

The FoLiA format makes mixed-use of inline and stand-off annotation. Inline annotation is used for annotations pertaining to single tokens, whilst stand-off annotation in a separate annotation layers is adopted for annotation types that span over multiple tokens. This provides FoLiA with the necessary flexibility and extensibility to deal with various kinds of annotations.

Notable features are:

Paradigm Schema

Resources

A more extensive list of FoLiA-capable software is maintained on the FoLiA website

Publications

See the FoLiA website for more publications and full text links.