An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic annotation finding application in Natural Language Processing (NLP). This library was formerly part of PyNLPl.
<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
<metadata type="native">
<annotations>
<correction-annotation />
<text-annotation />
<sentence-annotation />
<string-annotation />
</annotations>
</metadata>
<text xml:id="bug">
<s xml:id="s.1">
<t>Dit is een test</t>
<t class="ocr">D!t 1S tezt</t>
<str xml:id="str.1">
<t offset="0">Dit</t>
<t offset="0" class="ocr">D!t</t>
</str>
<str xml:id="str.2">
<t offset="4">is</t>
<t offset="4" class="ocr">1S</t>
</str>
<str xml:id="str.4">
<t offset="11">test</t>
<t offset="7" class="ocr">tezt</t>
</str>
<str xml:id="str.3"> <!-- I'm deliberately messing with the ordering here to emphasise that it has no meaning with strings-->
<t offset="7">een</t>
</str>
<!-- and below an extra string example to emphasise that strings are not tokens: this overlaps with str.1 and str.2) -->
<str xml:id="str.bonus">
<t offset="3">t is</t>
<t offset="3" class="ocr">t 1S</t>
</str>
</s>
</text>
</FoLiA>
folialint tests/textproblem_3.xml
tests/textproblem_3.xml failed: Unresolvable text: Text for str(ID=str.bonus, textclass='current'), has incorrect offset 3
original msg=Unresolvable text: Reference (ID s.1,class='current') found, but no text match at offset=3 Expected 't is' but got ' is '
Which is as desired, while the offset should be '2'.
In an example given in https://github.com/proycon/folia/issues/75 there seems to be an offset error, which goes undetected by foliavalidator. full example:
foliavalidator is happy with it:
folialint rejects this:
Which is as desired, while the offset should be '2'.