proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

some questions regarding the new <t-hspace> tag #95

Open kosloot opened 3 years ago

kosloot commented 3 years ago

recently a <t-hspace> tag is introduced, but when I started using it , some questions arose:

  1. It is possible the add some text to a <h-space> like this: <t-hspace>extra text</t-hspace> This is acceptable to foliavalidator and folialint, but doesn't show up in text() output. Probably that is OK In libfolia, it DOES show up, which is a bug I assume? But shouldn't we disallow this construct? To avoid strange effects and misunderstandings?
  2. There are NO predefined class values for <h-space>. I understand the ratio, but that poses a big burden on all tools that would like to make use of it. They all have to create their own text() extraction functions and would be very helped by a predefined set, that the libraries support. Like "tab", "space", "wide-space", or such. I realize that defining such a set might be a challenge, but still. The text() function is very complex and replicating it is cumbersome. (like handling of the tag' feature already showed us.) Another possibility might be a way of providing a translation table for those class values: tab ==> '\t' space ==> ' _' wide-space ==> ' __'
proycon commented 3 years ago
  1. Good point, this is indeed not intentional and should be disallowed.
  2. We could define a set, implement some support for it in the libraries, and recommend its usage. It's then simply up to users whether they decide to use that set or not (i.e. it'll be an opt-in choice).
kosloot commented 3 years ago

Good point, this is indeed not intentional and should be disallowed.

Maybe the same holds for a few of the other text Markup tags too?

We could define a set, implement some support for it in the libraries, and recommend its usage. It's then simply up to users whether they decide to use that set or not (i.e. it'll be an opt-in choice).

That would be great. Leaving us with a challenge to create a reasonable set.

kosloot commented 3 years ago

We can simply forbid text in a TextMarkupHSpace by adding 1 line in folia_properties.cxx:

//------ TextMarkupHSpace -------
    TextMarkupHSpace::PROPS = AbstractTextMarkup::PROPS;
    TextMarkupHSpace::PROPS.ACCEPTED_DATA.erase( XmlText_t );           <=== 1 extra line
    TextMarkupHSpace::PROPS.ELEMENT_ID = TextMarkupHSpace_t;

But maybe this is not generic enough?

Otherwise XmlText_t could be removed from AbstractTextMarkup::PROPS, and explicitly added for the Sub-classes it applies to?

proycon commented 3 years ago

Generally we have the TEXTCONTAINER property for this. ACCEPTED_DATA only carries FoLiA elements in my implementations.

kosloot commented 3 years ago

A right. That is a better solution, and it works:

folialint tests/bug59.xml
tests/bug59.xml failed: XML error: found extra text 'test' inside element <t-hspace>, NOT allowed there.

the input contained:

    <div xml:id="example.div.4" class="section" n="4">
      <t>Space,<t-hspace>test</t-hspace>the<t-hspace/>final<t-hspace/><t-hspace/>frontier</t>
    </div>
kosloot commented 3 years ago

Ok, but still there is room for rather suspicious constructions like:

      <t>Space,<t-hspace><t-str>test</t-str><t-hbr>what</t-hbr></t-hspace>the<t-hspace/>final<t-hspace/><t-hspace/>frontier</t>

This passes folialint and foliavalidator, and both folia2txt and FoLiA-2text ignore everything inside the <t-hspace> but still this is confusing and should be rejected imho

proycon commented 3 years ago

Agreed