proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Problem with text offset and Linebreak #52

Closed kosloot closed 4 years ago

kosloot commented 6 years ago

example:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia2html.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="WR-P-E-J-0000000001" generator="libfolia-v1.14" version="1.5">
  <text xml:id="WR-P-E-J-0000000001.text">
    <div>
      <head xml:id="sandbox.3">
        <t>De <br/><br/><br/><br/>FoLiA developers zijn:</t>
        <str xml:id="sandbox.3.str">
          <t offset="7">FoLiA</t>
        </str>
      </head>
    </div>
  </text>
</FoLiA>

C++'s libfolia accepts this, as it sees every <br/> as 1 character, so the offset of FoLiA is 7

Python's folia.py rejects this as it ignores all <br/> symbols and requires an offset of 3

I think libfolia is right here. but this is very tricky indeed.

proycon commented 6 years ago

I'd like to latch on some related issues to this one, as I've seen it in practice:

<t>De
FoLiA developers zijn:</t>

So a newline in the XML but not an explicit newline, meaning no newline as far as FoLiA is concerned. But it is still whitespace. So I think this is:

And not (I've seen this happen):

Add what about?

<t>De\s\s\s\s\s\s\s
FoLiA developers zijn:</t>

I'm not entirely sure how we handle that currently, I'd say it's still offset 3.

I do agree there is a good argument to consider the offset to be 4 in your above case of an explicit linebreak.

kosloot commented 6 years ago

well, libfolia's folialint happily accepts this:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia2html.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="WR-P-E-J-0000000001" generator="libfolia-v1.14" version="1.5">
  <metadata type="native">
    <annotations/>
  </metadata>
  <text xml:id="WR-P-E-J-0000000001.text">
    <div>
      <head xml:id="sandbox.3">
        <t>De
FoLiA developers zijn:</t>
        <str xml:id="sandbox.3.str">
          <t offset="3">FoLiA</t>
        </str>
      </head>
    </div>
  </text>
</FoLiA>

Also with offset 3.

Considering:

Add what about?

<t>De\s\s\s\s\s\s\s
FoLiA developers zijn:</t>

Regarding this (were there are 6 spaces behind 'De')

<?xml-stylesheet type="text/xsl" href="folia2html.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="WR-P-E-J-0000000001" generator="libfolia-v1.14" version="1.5">
  <metadata type="native">
    <annotations/>
  </metadata>
  <text xml:id="WR-P-E-J-0000000001.text">
    <div>
      <head xml:id="sandbox.3">
        <t>De      
FoLiA developers zijn:</t>
        <str xml:id="sandbox.3.str">
          <t offset="9">FoLiA</t>
        </str>
      </head>
    </div>
  </text>
</FoLiA>

folialint is really happy ...

kosloot commented 4 years ago

So this seems to be solved a long time ago