proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
GNU General Public License v3.0
60 stars 10 forks source link

Is this valid FoLiA? #104

Open kosloot opened 2 years ago

kosloot commented 2 years ago

Is this FoLiA valid? Both folialint and foliavalidator reject it (on different grounds)

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns="" xmlns:xlink="" xml:id="WR-P-E-J-0000000001" version="2.5" generator="libfolia-v0.4">
      <division-annotation annotator="ko" set="div"/>
      <sentence-annotation annotator="ko" set="sent"/>
      <text-annotation annotator="ko" set="aset"/>
      <text-annotation annotator="iemand" set="bset"/>
  <text xml:id="WR-P-E-J-0000000001.text">
    <div xml:id="WR-P-E-J-0000000001.div0.1" class="test">
      <t set="aset" class="a">Dit is test. zin 2.</t>
      <t set="bset" class="a">Dit is test. zin 3.</t>
      <s xml:id="WR-P-E-J-0000000001.head.1.s.1">
    <t set="aset" class="a">Dit is test.</t>
    <t set="bset" class="a">Dit is test.</t>
      <s xml:id="WR-P-E-J-0000000001.head.1.s.2">
    <t set="aset" class="a">zin 2.</t>
    <t set="bset" class="a">zin 3.</t>

folialint says:

failed: inconsistent text: conflicting text (class=a) from node: t() with value
'Dit is test. zin 3.'
 with parent: div(WR-P-E-J-0000000001.div0.1) which already has text in that class and value: 
'Dit is test. zin 2.'


VALIDATION ERROR on full parse by library (stage 2/3), in folia-bug.xml
ParseError: FoLiA exception in handling of <s> @ line 15 (in parent <div> @ parent line 12) : [NameError] name 'cls' is not defined

When I replace the class for the 'bset' by "b" in all the 3 cases, there is no problem

kosloot commented 2 years ago

The main problem is, that libfolia (and supposedly also FoLiApy) doesn't have a provision for handling text with the same textclass from different sets. (as far as I know, the text-checking code NEVER takes sets into account)

I have no objection to require all text in a document to stem from one set only, but I assume that a better solution would be to amend the code, and take the set names seriously.

kosloot commented 2 years ago

So we have some serious questions here:

  1. may there be more then one <text-annotation> in a document? I suppose YES, (with different setnames, of course )
  2. Having more then one <text-annotation>, may there be the same class names in different sets? I suppose YES too

But this has great ramifications for the current code, which really has NO support for more then one text-annotation. e.g. functions like hastext() only look at the class name. And ALL checks for text consistency will fail when 2 sets are in sight. This is true for libfolia, at least. But I have NO indication that FoLiApy behaves any better.

kosloot commented 2 years ago

For now, libfolia has some code added to prevent these problems. Only one text_annotation may be declared. It remains a wish to relax this limitation.