proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

extracting text from corrections. What are the semantics? #98

Closed kosloot closed 3 years ago

kosloot commented 3 years ago

consider this (rather silly) example: corr_1.txt

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test" version="2.5" generator="diy">
  <metadata type="native">
    <annotations>
      <text-annotation/>
      <sentence-annotation/>
      <paragraph-annotation/>
      <correction-annotation/>
      <token-annotation annotator="ucto" annotatortype="auto" datetime="2017-09-25T10:29:52" set="tokconfig-nld"/>
      <style-annotation />
    </annotations>
  </metadata>
  <text xml:id="example.text">
    <p xml:id="example.p.1">
      <s xml:id="example.p.1.s.1">
        <t>het creëren van</t>
        <t class="old">het creeren van</t>
        <t class="older">het CREEREN van</t>
        <w xml:id="example.p.1.s.1.w.1" class="WORD">
          <t offset="0">het</t>
          <t class="old" offset="0">het</t>
          <t class="older" offset="0">het</t>
        </w>
        <w xml:id="example.p.1.s.1.w.2" class="WORD">
          <correction xml:id="cor.1">
           <new>
              <t offset="4">creëren</t>
           </new>
           <original auth="no">
             <correction xml:id="corr.2">
               <new>
                 <t class="old" offset="4">creeren</t>
               </new>
               <original auth="no">
                 <t class="older" offset="4">CREEREN</t>
               </original>
             </correction>
           </original>
          </correction>
        </w>
        <w xml:id="example.p.1.s.1.w.4" class="WORD">
          <t offset="12">van</t>
          <t class="old" offset="12">van</t>
          <t class="older" offset="12">van</t>
        </w>
      </s>
    </p>
  </text>
</FoLiA>

so this is a corrected correction, where all <t> members got a class assigned and the most recent "current" I agree this is odd, but it is valid FoLiA, according to both folialint and foliavalidator.

Now lets see what FoLiA-2text and folia2txt make of this:

First with class-current:

FoLiA-2text corr_1.xml Processed :corr_1.xml into corr_1.xml.txt still 0 files to go. more corr_1.xml.txt het creëren van

folia2txt corr_1.xml Converting corr_1.xml het creëren van

GOOD! Now with class=old:

FoLiA-2text -c old corr_1.xml Processed :corr_1.xml into corr_1.xml.txt still 0 files to go. more corr_1.xml.txt het creeren van

folia2txt -c old corr_1.xml Converting corr_1.xml het van

HMM, BAD??

And with class=older:

FoLiA-2text -c older corr_1.xml Processed :corr_1.xml into corr_1.xml.txt still 0 files to go. more corr_1.xml.txt het CREEREN van

folia2txt -c older corr_1.xml Converting corr_1.xml het van

Also BAD?

What would be the desired outcome? The FoLiA-2text outcome has the least surprisal. But is it correct?

proycon commented 3 years ago

The text class parameter is not sufficient information to be able to extract the proper correction, we really need correction handling for that as discussed in LanguageMachines/foliautils#60. folia2txt doesn't implement a direct interface to this yet, the current way to handle it would be a two-step process, first modifying the document with foliacorrect --corrected or foliacorrect --original (which effectively resolves/removes all correction elements), and then doing folia2txt.

But even with correction handling there is an extra challenge in this example because of the nested correction (which I was aware of already but never implemented a solution for). The correction handling allows for retrieving either the deepest original solution (CREEREN/older) or the most current one (creëren/current), it does not accommodate finding an intermediate original currently (creeren/old).

kosloot commented 3 years ago

But even with correction handling there is an extra challenge in this example I was fully aware of that :)

I even wonder if this is totally resolvable with 'CorrectionHandling'. In the libfolia implementation, I use 'CorrectionHandling' as a hint for searching at most. The primary search condition is the 'class'. And I search in <New> and <Current> first, and in Original as a last resort. This works (in this example at least), but is a bit flaky I think.

CorrectionHandling.ORIGINAL should probably search in <Original> nodes first, but what if nothing is found? Search in the other nodes? Or return an empty result? really tricky

martinreynaert commented 3 years ago

Returning an empty result would never be a good idea, I think.

I would not actually want to run TICCL twice over the same FoLiA, for a new correction round I would restart from the original FoLiA and give it another version number.

But I can well envisage automatically corrected text being further manually (re)corrected and then this problem would probably be at play.

proycon commented 3 years ago

CorrectionHandling.CURRENT should only search in current/new nodes and CorrectionHandling.ORIGINAL only in original nodes. There is a CorrectionHandling.EITHER in case you don't care want something back, but the usefulness of that option is debatable as it's rather undefined behaviour what happens.

Situations in which you get 'nothing' back are meaningful in the sense that they usually imply a deletion.

martinreynaert commented 3 years ago

Ok, I did not think of deletions. You're quite right there.

kosloot commented 3 years ago

CorrectionHandling.CURRENT should only search in current/new nodes and CorrectionHandling.ORIGINAL only in original nodes.

But you will end up in trouble in the above example anyway. I don't see an escape yet

proycon commented 3 years ago

A possible solution would be implementing a maximum depth parameter for correction handling, but I don't have any use cases for this so I wouldn't bother implementing this for now.

kosloot commented 3 years ago

Well: ucto is able to split words (like separating !., and ?). This can be applied to the outcome of FoLiA-correct. So there you go. A 2-level correction.

Also: doesn't FLAT allow for correcting corrections?

martinreynaert commented 3 years ago

As we originally envisaged it, UCTO would indeed be the next logical step after TICCL-correcting a book, say, in order to move away from 'str' to e.g. 's'.

And as soon as I have TICCLed Nicoline's 17th century newspapers, I indeed want to ask her to set some of her volunteers at work manually correcting the corrections. And for that, I do have FLAT in mind.

proycon commented 3 years ago

Also: doesn't FLAT allow for correcting corrections?

Yep, it does. It can also display the nesting. And it can visualize either the current/new text or the original one.

pirolen commented 3 years ago

(Off: @proycon when do you think FLAT will be capable of ingesting FoLiA 2.5?)

proycon commented 3 years ago

The latest development version should be able to handle FoLiA v2.5 already right? But a release is indeed due (there's still some other work stuck in the pipeline awaiting completion though).

proycon commented 3 years ago

I'm closing this one as it was more a question thread which has been settled, and the derived issues LanguageMachines/foliautils#60 and proycon/foliatools#40 are implemented now.