proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Is this folia valid? And how to handle it... #69

Open kosloot opened 5 years ago

kosloot commented 5 years ago

Given this FoliA example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="apart" generator="libfolia-v0.11" version="1.5">
  <metadata type="native">
    <annotations>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div xml:id="text.div.1">
      <head xml:id="head">
    <t>Kop</t>
      </head>
      <p xml:id="text.div.1.p.1">
    <t>OK?</t>
      </p>
      <s xml:id="text.div.1.s.1">
        <t>Inleiding</t>
        <w xml:id="text.div.1.s.1.w.1">
          <t>Inleiding</t>
        </w>
      </s>
    </div>
  </text>
</FoLiA>

it is accepted by the validator and also folialint. There is maybe an issue here with the <div> having both a Sentence AND a Paragraph. This is valid FoLiA, but maybe it is against the 'gut feeling' that in this case the Sentence should be embedded in a Paragraph. Should this feeling be formalized? (an how) And if not, this has ramifications for a lot of FoLiA based software, liike Ucto, Frog, TICCL etc., that assumes OR sentences OR paragraphs with sentences.

proycon commented 5 years ago

Yes, it is valid indeed. I agree it would be nicer when it's homogenous but I don't think we can/should enforce that in FoLiA itself, if tools like ucto/frog pose such extra constraints then that's fair enough I'd say.

kosloot commented 5 years ago

Hmmm. imposing such implicit semantics might hit us hard in the future. But making homogeneity a strict prerequisite is maybe hard to implement. (can a DTD express this at all?) Maybe it is good to document this as 'good behavior' for FoLiA users. Or have the validator point it out? I will consider adding warnings (or even fatal errors) to the tools in my control.

kosloot commented 5 years ago

see also issue #42