proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

How to represent quotations in FoLiA #70

Closed kosloot closed 5 years ago

kosloot commented 5 years ago

Are quotation marks PART of the Quote ??? I wonder what would be correct here:

a "quote" here

Is the (simplified) FoLiA:

<w xml:id="w1" class="WORD">
  <t>a</t>
</w>
<w xml:id="w2" class="PUNCTUATION" space="no">
  <t>"</t>
</w>
<quote xml:id="foliatest.p.1.s.2.quote.1">
  <w xml:id="w3" class="WORD" space="no" >
    <t>quote</t>
  </w>
</quote>
<w xml:id="w4" class="PUNCTUATION">
  <t>"</t>
</w>
<w xml:id="w5" class="WORD">
  <t>here</t>
</w>

Or would THIS be better:

<w xml:id="w1" class="WORD">
  <t>a</t>
</w>
<quote xml:id="foliatest.p.1.s.2.quote.1">
  <w xml:id="w2" class="PUNCTUATION" space="no">
    <t>"</t>
  </w>
  <w xml:id="w3" class="WORD" space="no" >
    <t>quote</t>
  </w>
  <w xml:id="w4" class="PUNCTUATION">
    <t>"</t>
  </w>
</quote>
<w xml:id="w5" class="WORD">
  <t>here</t>
</w>

The first version is produced by Ucto at the moment.

proycon commented 5 years ago

Good question, we don't really force either way. The current documentation and example have the quote tokens outside the quote element.

kosloot commented 5 years ago

I would tend to prefer the LATTER version. That would for instance removing /ignoring quotations more easy.

antalvdb commented 5 years ago

It would make sense to include the quotation marks inside the quote (and to adapt the documentation with the example).

proycon commented 5 years ago

Ok, since we're about to release the new FoLiA this is probably a good timing for such a change in convention, but still it remains a convention otherwise it breaks backward compatibility.

kosloot commented 5 years ago

So probably it is a good idea to have a section in the manual with guidelines about 'conventions' and 'fair use' of FoLiA, as it is not feasible to check for all kinds of abuse. All FoLiA tools should adapt to these conventions, and accept such 'conforming' Documents. They may choose NOT to completely handle documents that are 'valid FoLiA' but not conforming to this rules, although they shouldn't crash or misbehave by mangling the FoLiA. But issuing warnings, ignoring parts or just exit is acceptable.

kosloot commented 5 years ago

I think this is settled