proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Is this valid FoLiA? #104

Open kosloot opened 2 years ago

kosloot commented 2 years ago

Is this FoLiA valid? Both folialint and foliavalidator reject it (on different grounds)

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="WR-P-E-J-0000000001" version="2.5" generator="libfolia-v0.4">
  <metadata>
    <annotations>
      <division-annotation annotator="ko" set="div"/>
      <sentence-annotation annotator="ko" set="sent"/>
      <text-annotation annotator="ko" set="aset"/>
      <text-annotation annotator="iemand" set="bset"/>
    </annotations>
  </metadata>
  <text xml:id="WR-P-E-J-0000000001.text">
    <div xml:id="WR-P-E-J-0000000001.div0.1" class="test">
      <t set="aset" class="a">Dit is test. zin 2.</t>
      <t set="bset" class="a">Dit is test. zin 3.</t>
      <s xml:id="WR-P-E-J-0000000001.head.1.s.1">
    <t set="aset" class="a">Dit is test.</t>
    <t set="bset" class="a">Dit is test.</t>
      </s>
      <s xml:id="WR-P-E-J-0000000001.head.1.s.2">
    <t set="aset" class="a">zin 2.</t>
    <t set="bset" class="a">zin 3.</t>
      </s>
    </div>
  </text>
</FoLiA>

folialint says:

failed: inconsistent text: conflicting text (class=a) from node: t() with value
'Dit is test. zin 3.'
 with parent: div(WR-P-E-J-0000000001.div0.1) which already has text in that class and value: 
'Dit is test. zin 2.'

folivalidator:

VALIDATION ERROR on full parse by library (stage 2/3), in folia-bug.xml
ParseError: FoLiA exception in handling of <s> @ line 15 (in parent <div> @ parent line 12) : [NameError] name 'cls' is not defined

When I replace the class for the 'bset' by "b" in all the 3 cases, there is no problem

kosloot commented 2 years ago

The main problem is, that libfolia (and supposedly also FoLiApy) doesn't have a provision for handling text with the same textclass from different sets. (as far as I know, the text-checking code NEVER takes sets into account)

I have no objection to require all text in a document to stem from one set only, but I assume that a better solution would be to amend the code, and take the set names seriously.

kosloot commented 2 years ago

So we have some serious questions here:

  1. may there be more then one <text-annotation> in a document? I suppose YES, (with different setnames, of course )
  2. Having more then one <text-annotation>, may there be the same class names in different sets? I suppose YES too

But this has great ramifications for the current code, which really has NO support for more then one text-annotation. e.g. functions like hastext() only look at the class name. And ALL checks for text consistency will fail when 2 sets are in sight. This is true for libfolia, at least. But I have NO indication that FoLiApy behaves any better.

kosloot commented 2 years ago

For now, libfolia has some code added to prevent these problems. Only one text_annotation may be declared. It remains a wish to relax this limitation.