text inside <t-hbr> nodes is allowed, but problematic

kosloot commented 1 year ago

Given this example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="hyp" generator="libfolia-v2.13" version="2.5.1">
  <metadata type="native">
    <annotations>
      <hyphenation-annotation set="tst"/>
      <style-annotation set="tst"/>
      <paragraph-annotation set="tst"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="hyp.text">
    <p xml:id="hyp.p.1">
      <t class="FoLiA-txt">Appel<t-hbr>peren</t-hbr>taart</t>
    </p>
  </text>
</FoLiA>

folia2txt produces: Appeltaart Which is maybe consistent with the docs , although it is not explicitly forbidden (should isn't may not) FoLiA-2text gives: Appelperentaart, so does interpret the embedded text.

My problem with this is, dat NOT including the text violates the principle of least surprise. By just looking in the text of the <p> you might expect the peren to show up.

The Best solution imho is to explicitly forbid a text content inside a <t-hbr> an give an error when it is attempted.

proycon commented 1 year ago

I think the idea was that could contain the actual hyphenation symbol that was used in the text, reflecting that it is an actual symbol appearing in the text. You'd need it again if you'd want to serialize the text exactly as it was. If we'd forbid it entirely as you suggest (which makes things simpler indeed), we'd need to assume there is one generic hyphenation symbol (e.g. a hyphen) that we can use in case the user wants the 'original' text exactly.

By default, foliapy indeed resolves the <t-hbr>, I don't think there's currently an implementation that reconstructs the data including hyphens, but at least it's a theoretical possibility.

I'm also a bit hesistant to simplify this now after the fact as it'd break backward compatibility (I don't whether there a are a lot of documents out there using this, probably not, but can't be sure).

kosloot commented 1 year ago

I think the idea was that could contain the actual hyphenation symbol that was used in the text, reflecting that it is an actual symbol appearing in the text.

Which was exactly what I tried for FoLiA-abby, leading to this discussion. As we already worked out, the hyphenation symbol can better be handled in the class feature. Like: <t-hbr class="¬"/>

By default, foliapy indeed resolves the <t-hbr>,

Where `resolves' implies, IGNORING completely. FoliaPY doesn't implement a feature to include it. And libfolia lacks a way of expressing to ignore I suggest amending libfolia to ignore the text value (and maybe give a warning?) and leave a mechanism to include the content as a wish for both foliaPY and libfolia. For whenever the need comes up.

proycon commented 1 year ago

Looks good yes!

kosloot commented 1 year ago

In hindsight, I think it is much better to NOT ABUSE the class feature for storing the actual hyphen. I strongly suggest to put the hyphen as a value in the <t-hbr/>, as I previously did. e.g. <t-hbr>¬</t-hbr> We then have to make an extra text-retrieving option to include that, otherwise ignored, text. We need such an option anyway if we want to extract the value from the class. Such an option is very easy to implement. (at least in libfolia).

That an evil user could use that value to store more than just a hyphen is not really a problem imho

As a bonus, we could allow text inside a <t-hspace>, for instance to store all kind of Unicode space characters, or sequences of more than one space. Those will be filtered out of the "normalized" text anyway.

proycon commented 1 year ago

I agree with your assessment. What is in the text should be text, so putting them in the xml text was the good decision, as often these hyphens appear verbatim in the text. Even though by default we may want to ignore this text.

That an evil user could use that value to store more than just a hyphen is not really a problem imho

Agreed

As a bonus, we could allow text inside a , for instance to store all kind of Unicode space characters, or sequences of more than one space. Those will be filtered out of the "normalized" text anyway.

That makes sense yes. It does raise some difficulties due to the way XML deals with whitespace if the element were to be split over multiple lines.

kosloot commented 1 year ago

Ok, I will stick to that solution then. ``

It does raise some difficulties due to the way XML deals with whitespace if the element were to be split over multiple lines. ``

Well, I was referring to a <t-hspace> , which is documented te be for horizontal spacing. Not <whitespace> which should be used for vertical spacing. I agree that a malicious user could smuggle linefeed or pagefeed characters into the <t> inside <t-hspace>. So be it. In "normal use" (including standard text() extraction, we will discard them. For the specialized text() extraction I made, which tries to restore as much as possible, they will be retrieved "as is", which should be fine.

I will implement this and create some simple examples for testing

kosloot commented 1 year ago

I created a simple example to demonstrate what libfolia is capable of now:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="markup" generator="libfolia-v2.14" version="2.5.1">
  <metadata type="native">
    <annotations>
      <sentence-annotation set="FoLiA-txt-set"/>
      <paragraph-annotation set="FoLiA-txt-set"/>
      <linebreak-annotation set="FoLiA-txt-set"/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <hyphenation-annotation set="FoLiA-txt-set"/>
      <hspace-annotation set="FoLiA-txt-set"/>
    </annotations>
  </metadata>
  <text xml:id="markup.text">
    <p xml:id="markup.p.1">
      <s xml:id="markup.p.1.s.1">
        <t>Dit is een test. Met enkele afgebroken zin<t-hbr>-</t-hbr><br space="no"/>nen. Dit is de eerste. Met ook een soft af<t-hbr>¬</t-hbr><br space="no"/>breking.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.2">
        <t>En dit heeft een lange'<t-hspace>       </t-hspace>'space val.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.3">
        <t>En dit heeft een enkele'<t-hspace/>'space val.<br/></t>
      </s>
      <s xml:id="markup.p.1.s.4">
        <t>En dit heeft een speciale'<t-hspace>⍽</t-hspace>'space val.<br/></t>
      </s>
    </p>
  </text>
</FoLiA>

Both folialint and foliavalidator accept this file. OK Both FoLiA-2text and folia2txt parse this file and produce:

Dit is een test. Met enkele afgebroken zin
nen. Dit is de eerste. Met ook een soft af
breking.
En dit heeft een lange' 'space val.
En dit heeft een enkele' 'space val.
En dit heeft een speciale' 'space val.

NOTE @proycon : folia2text slides in some leading spaces, after the first sentence. A minor bug? Besides that: OK

BUT FoLiA-2text can now also produce, (using --restore-formatting):

Dit is een test. Met enkele afgebroken zin-
nen. Dit is de eerste. Met ook een soft af¬
breking.
En dit heeft een lange'       'space val.
En dit heeft een enkele' 'space val.
En dit heeft een speciale'⍽'space val.

Which is very cool, I think.

Second NOTE: This example is handcrafted, at the moment there is no DIRECT way to create a FoLiA file with "special" <t-hspace> nodes. The only tool that I have that does this, is FoLiA-abby.

kosloot commented 1 year ago

@pirolen Please note: We reverted back to the solution before January 17th. So Abby files produced with the GIT version are not usable (although still valid) The implementation should now be OK (but needs more testing) Also FoLiA-2text seems to work OK with the new --restore-formatting option (as shown above)

sorry for the inconvenience. Hope you can test the Git version.

proycon / foliapy

text inside <t-hbr> nodes is allowed, but problematic #25