proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Folia text validation on corrections #75

Open kosloot opened 4 years ago

kosloot commented 4 years ago

Given this FoLiA. which contains a deletion correction,

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
  <metadata type="native">
    <annotations>
      <correction-annotation />
      <text-annotation />
      <sentence-annotation />
      <token-annotation />
    </annotations>
  </metadata>
  <text xml:id="bug">
    <s xml:id="s.1">
      <t>Dit is een test</t>
      <t class="out">Dit is test</t>
      <w xml:id="w.1">
        <t>Dit</t>
        <t class="out">Dit</t>
      </w>
      <w xml:id="w.2">
        <t>is</t>
        <t class="out">is</t>
      </w>
      <correction>
    <original>
      <w xml:id="w.3">
            <t>een</t>
      </w>
    </original>
      </correction>
      <w xml:id="w.4">
        <t>test</t>
        <t class="out">test</t>
      </w>
    </s>
  </text>
</FoLiA>

foliavalidator gives this result:

VALIDATION ERROR on full parse by library (stage 2/3), in tests/textbug-del2.xml
InconsistentText: Text for <Sentence at 140323334247928 id=s.1 set=None class=None>, is inconsistent: EXPECTED (after normalization) *****>
Dit is test
****> BUT FOUND (after normalization) ****>
Dit is een test
******* DEVIATION POINT: Dit is <*HERE*>een test

folialint also doesn't like this:

tests/textbug-del2.xml failed: inconsistent text: node s(s.1) has a mismatch for the text in set:current
the element text ='Dit is een test'
 the deeper text ='Dit is eentest'

So folialint chokes on the 'current' textclass foliavalidator on the 'none' class. Probably 'current' too? although de text seems to belong to 'out'

anyway. lot's of trouble on both sides....

proycon commented 4 years ago

This is a complicated issue that we need to address properly, so I'll type it out in full also for my own image forming. I think we're close to one of FoLiA's boundaries here: in principle FoLiA does not support multiple tokenisations of the same text. A text has one tokenisation and all further linguistic annotations that are based on tokens use the same ones. This is a deliberate limitation/simplification as things get complicated and messy if there are multiple often conflicting tokenisations. (Other formats may do this, by letting each linguistic annotation explicitly refer to character offsets of the original text, essentially letting each annotation layer define its own 'tokens', if I can put it like that).

Having said that, there is the <correction> element in FoLiA that pushes this boundary, as you are allowed to correct tokens (<w>) themselves. It indeed allows you to alter the tokenisation (and the underlying text itself). It essentially says:

(Technical comment: In the foliapy library, this is what the correctionhandling parameter on the text() method does, it specifies which path to follow, allowing you to always reconstruct the original text if it's unambiguous).

Second, we have text redundancy in FoLiA, i.e. it is possible to express text on multiple levels (e.g. sentence level and word level). If there is text on multiple levels, we have our text consistency rule which is checked by our libraries (as we notice in this issue): it enforces that text a higher level must always be consistent with the text on a deeper level. Text on a higher level than the token level is by definition untokenised.

A third feature of FoLiA is that we support multiple text layers. Instead of just a single text reading of a certain structural element, there may be multiple. Consider a text layer right after an OCR system, and one after normalisation. Multiple text layers are each identified by their own class (if the class is omitted, which is almost always is if only one text layer exists, the default class "current" is assigned). The fact it's called current alludes to the fact that it is the most current reading (as opposed to something that is corrected), so it still carries some special meaning (as one of the rare exceptions in FoLiA since we never predefine any other classes).

In this issue, we see all of these three features combined, and a potential conflict lurks in the woods: we can express multiple text layers at the higher level (sentence) but we can not express multiple tokenisations for each of those text layers, only one.

I understand your goal is to relate the two text levels, or tokens therein, to eachother.

When it comes to corrections, the "current" text class still does trigger some special behaviour, when you reverse your example by using the "current" class instead of "out" and something like "in" for the original pre-corrected text, then things already look better:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
  <metadata type="native">
    <annotations>
      <correction-annotation />
      <text-annotation />
      <sentence-annotation />
      <token-annotation />
    </annotations>
  </metadata>
  <text xml:id="bug">
    <s xml:id="s.1">
      <t class="in">Dit is een test</t>
      <t>Dit is test</t>
      <w xml:id="w.1">
        <t class="in">Dit</t>
        <t>Dit</t>
      </w>
      <w xml:id="w.2">
        <t class="in">is</t>
        <t>is</t>
      </w>
      <correction>
        <original>
          <w xml:id="w.3">
            <t class="in">een</t>
          </w>
        </original>
        <new/> <!-- you didn't have this, not required as it was assumed but I'd rather make it explicit -->
      </correction>
      <w xml:id="w.4">
        <t class="in">test</t>
        <t>test</t>
      </w>
    </s>
  </text>
</FoLiA>

So, like you intended, in this situation we have a token w.3 that only exists in the original input (it doesn't reference the current text layer) and is deleted in the current text.

$ foliavalidator /nettmp/issue75rev.folia.xml
Validated successfully: /nettmp/issue75rev.folia.xml

foliavalidator accepts it, but not out of wisdom, it also accepts it if I change w.3 for textclass "in" to "geen", creating inconsistent text. The reason is: I simply don't do proper text validation when the situation gets overly complex (see https://github.com/proycon/foliapy/blob/master/folia/main.py#L1285) . In this case things are still unambiguous though, only when corrections get nested it becomes truly irresolvable to get the original text (as there is no longer a single one).

$ folialint /nettmp/issue75rev.folia.xml
/nettmp/issue75rev.folia.xml failed: inconsistent text: node s(s.1) has a mismatch for the text in set:in
the element text ='Dit is een test'
 the deeper text ='Dit is eentest'

folialint seems to stumble on a text delimiter issue (the space from w.2 seems forgotten because there's a correction in between it seems, but that technically is another bug than the one we are discussing and it looks as if the concistency check would pass too if that were fixed):

But, I would say this example is indeed valid as none of our rules are broken. We have one tokenisation (without w.3), and the multiple text layers are consistent on all levels.

I also worked out a bit of a more sensible example that might be closer to a real use case and should be valid, but it essentially does the same, an OCR text with a deletion (a coffee stain if you will) and normalised output:

<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
  <metadata type="native">
    <annotations>
      <correction-annotation />
      <text-annotation />
      <sentence-annotation />
      <token-annotation />
    </annotations>
  </metadata>
 <text xml:id="issue75a">
    <s xml:id="s.1">
      <t>Dit is een test</t>
      <t class="ocr">D!t 1S een @#~ tezt</t>
      <w xml:id="w.1">
        <t>Dit</t>
        <t class="ocr">D!t</t>
      </w>
      <w xml:id="w.2">
        <t>is</t>
        <t class="ocr">1S</t>
      </w>
      <w xml:id="w.3">
        <t>een</t>
        <t class="ocr">een</t>
      </w>
      <correction>
        <original>
          <w xml:id="w.coffeestain">
            <t class="ocr">@#~</t>
          </w>
        </original>
        <new/>
      </correction>
      <w xml:id="w.4">
        <t>test</t>
        <t class="out">test</t>
      </w>
    </s>
  </text>
 </FoLiA>

So, is the problem solved? Not sure yet. Let's try an insertion as well:

<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
  <metadata type="native">
    <annotations>
      <correction-annotation />
      <text-annotation />
      <sentence-annotation />
      <token-annotation />
    </annotations>
  </metadata>
 <text xml:id="issue75a">
    <s xml:id="s.1">
      <t>Dit is een test</t>
      <t class="ocr">D!t 1S tezt</t>
      <w xml:id="w.1">
        <t>Dit</t>
        <t class="ocr">D!t</t>
      </w>
      <w xml:id="w.2">
        <t>is</t>
        <t class="ocr">1S</t>
      </w>
      <correction>
        <original/>
        <new>
          <w xml:id="w.3">
            <t>een</t>
          </w>
        </new>
      </correction>
      <w xml:id="w.4">
        <t>test</t>
        <t class="ocr">tezt</t>
      </w>
    </s>
  </text>
 </FoLiA>

Both validators accept it, and I would say it's valid too. So that probably solves this issue and your solution with some small adaption is feasible indeed. Still, I want to add a last part to the discussion as this popped up in recent discussions and is relevant to get the whole picture:

Like <correction>, there is another element that addresses FoLiA's limitation of allowing only a single tokenisation, which is the string (<str>) element. The string element allows annotation on arbitrary parts of a text. Unlike tokens, strings are not a structural element but a higher-order annotation, so unlike tokens they may overlap and may describe any arbitrary substring of the original text, referencing it explicitly through character offsets. Since string annotations are not technically tokens (there order in the XML is also irrelevant, unlike tokens), it offers a way around describing arbitrary substrings and providing the necessary flexibility in situations where this is needed. (One of the reasons for its inception was actually for TICCL). I think though, in general, if you can use words/tokens over strings, then that's always better, unless you really can't commit to a single tokenisation.

If you really want to avoid the complexity of <correction> I can still see valid use cases for using strings to describe untokenised substrings and encoding the relationship between text layers as ticcl originally did/does. But it's semantically different and the end result will remain an untokenised document and the 'corrections' are more implicit rather than explicit:

    <s xml:id="s.1">
      <t>Dit is een test</t>
      <t class="ocr">D!t 1S tezt</t>
      <str xml:id="str.1">
        <t offset="0">Dit</t>
        <t offset="0" class="ocr">D!t</t>
      </str>
      <str xml:id="str.2">
        <t offset="4">is</t>
        <t offset="4" class="ocr">1S</t>
      </str>
      <str xml:id="str.4">
        <t offset="11">test</t>
        <t offset="7" class="ocr">tezt</t>
      </str>
      <str xml:id="str.3"> <!-- I'm deliberately messing with the ordering here to emphasise that it has no meaning with strings-->
        <t offset="7">een</t>
      </str>
      <!-- and below an extra string example to emphasise that strings are not tokens: this overlaps with str.1 and str.2) -->
      <str xml:id="str.bonus">
        <t offset="3">t is</t>
        <t offset="3" class="ocr">t 1S</t>
      </str>
    </s>
kosloot commented 4 years ago

Thanks for the long story. There is a lot to say about this subject. I think this is not the most convenient platform to discus all, but a few remarks:

proycon commented 4 years ago

The fact that "current" is a predefined class, unlike any in FoLiA, has always be a bit inelegant indeed. Some mechanism to let the user determine the name of the most current class and/or the default class might indeed be a nice enhancement. The best place for that would be in the declarations block I think, can probably be done with an XML attribute.

Still, even without that, you should be able to get by in the current situation by simple renaming the classes.

the usage of within TICCL is in hindsight probably NOT a good plan.

Agreed, I think TICCL performs a role similar (but more complex) to a tokeniser and as such should produce tokens.

proycon commented 4 years ago

I'm doing some initial thinking on this, and I'd say adding "defaulttextclass" and "currenttextclass" attributes to the <text-annotation> declaration might accommodate the suggested enhancement:

<text-annotation set="..." defaulttextclass="current" currenttextclass="current" />

I'm deliberately splitting the default and current notion for extra flexibility. The above values would also be the default if the attributes wouldn't be set explicitly.

proycon commented 4 years ago

Or perhaps defaultclass and currentclass as it gets unnecessarily long otherwise.

kosloot commented 4 years ago

This sounds feasible. having "current" as the default means that older documents are still valid. (or weren't in the first place)

kosloot commented 4 years ago

Ans I assume that the default for defaultclass is currentclass (or VV)

proycon commented 4 years ago

yes

kosloot commented 1 year ago

Ok, so it seems good to implement the defaultclass="current" idea soon. Not convinced that we need a currentclass too. But, as a sidenote: Having a way (meta tag?) indicating which set of textclasses is present in the document might also be handy, as it enables a tool to directly skip a document when a desired textclass is not present, without parsing the whole document.