Problems with leading/trailing whitespace in text content

proycon commented 3 years ago

We had an extensive earlier discussion on this in #34, but an issue popped up.

foliatextcontent produces FoLiA likke the follow:

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21">
       <t>
         <t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str>
       </t>
       <t class="OCR">
         <t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str>
       </t>
       <str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">
         <t offset="0">INTRODUCTION</t>
         <t offset="0" class="OCR">INTRODUCTION</t>
         <relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple">
           <xref id="word_1_233" type="str"/>
         </relation>
       </str>
     </p>

folialint stumbles on this with a text consistency problem:

ticcl_output/OllevierGeets.ticcl.folia.xml failed: Unresolvable text: Text for str(ID=FH-OllevierGeets-1.tif.text.par_1_21.word_1_233, textclass='current'), has incorrect offset 0
        original msg=Unresolvable text: Reference (ID FH-OllevierGeets-1.tif.text.par_1_21,class='current') found, but no text match at offset=0 Expected 'INTRODUCTION' but got '
        INT'

Because of the newline and indentation, the offset is considered wrong, as the text is assumed to be "\n\s\s\s\s\s\s\s\s\INTRODUCTION".

foliavalidator stumbles over something identical but later on (different order of evaluation perhaps?):

TEXT VALIDATION ERROR: Text for String, ID FH-OllevierGeets-4.tif.text.par_1_36.word_1_708, textclass OCR, has incorrect offset 0 or invalid reference
VALIDATION ERROR on full parse by library (stage 2/3), in OllevierGeets.ticcl.folia.xml
UnresolvableTextContent: Reference (ID FH-OllevierGeets-4.tif.text.par_1_36, class=OCR) found but no text match at specified offset (0)! Expected 'DISCUSSION', got '
        D'

The offsets do not do any kind of space normalization by default, as addressed in #34, a text like:

<s>
    <t>This is
         a sentence</t>
</s>

This really means This is\n\s\s\s\s\s\s\s\s\sa sentence. and not This is a sentence.

But, I think we should be able to strip leading and trailing spaces from the text as a whole, I think the following fragment below should be semantically identical to the first fragment. The fact that in turned into the fragment above is probably because of standard XML prettification algorithms.

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21">
       <t><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t>
       <t class="OCR"><t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str></t>
       <str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">
         <t offset="0">INTRODUCTION</t>
         <t offset="0" class="OCR">INTRODUCTION</t>
         <relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple">
           <xref id="word_1_233" type="str"/>
         </relation>
       </str>
     </p>

Just like we don't allow empty texts, I think we can probably strip leading and trailing spaces (=emptiness) when doing text validation and offset computation (this does not affect any intermediate spaces, also not in multiline content!).

martinreynaert commented 3 years ago

I agree. I think it is far better to normalize these. Thanks!

pirolen commented 3 years ago

Dear Maarten,

I was just about to point out a whitespace issue when using ucto — not sure, if fully related. There are whitespace insertions and deletions. Where shall I report this?

Thanks & cheers, Piroska

On Dec 8, 2020, at 2:15 PM, Maarten van Gompel notifications@github.com wrote:

We had an extensive earlier discussion on this in #34, but an issue popped up.

foliatextcontent produces FoLiA likke the follow:

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21"

< t

< t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str

</ t

< t class="OCR"

< t-str id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233">INTRODUCTION</t-str

</ t

< str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233"

< t offset="0">INTRODUCTION</t

< t offset="0" class="OCR">INTRODUCTION</t

< relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple"

< xref id="word_1_233" type="str" /> </ relation

</ str

</ p> folialint stumbles on this with a text consistency problem:

ticcl_output/OllevierGeets.ticcl.folia.xml failed: Unresolvable text: Text for str(ID=FH-OllevierGeets-1.tif.text.par_1_21.word_1_233, textclass='current'), has incorrect offset 0 original msg=Unresolvable text: Reference (ID FH-OllevierGeets-1.tif.text.par_1_21,class='current') found, but no text match at offset=0 Expected 'INTRODUCTION' but got ' INT'

Because of the newline and indentation, the offset is considered wrong, as the text is assumed to be "\n\s\s\s\s\s\s\s\s\INTRODUCTION".

foliavalidator stumbles over something identical but later on (different order of evaluation perhaps?):

TEXT VALIDATION ERROR: Text for String, ID FH-OllevierGeets-4.tif.text.par_1_36.word_1_708, textclass OCR, has incorrect offset 0 or invalid reference VALIDATION ERROR on full parse by library (stage 2/3), in OllevierGeets.ticcl.folia.xml UnresolvableTextContent: Reference (ID FH-OllevierGeets-4.tif.text.par_1_36, class=OCR) found but no text match at specified offset (0)! Expected 'DISCUSSION', got ' D'

The offsets do not do any kind of space normalization by default, as addressed in #34, a text like:

<s

< t This is a sentence</ t

</ s> This really means This is\n\s\s\s\s\s\s\s\s\sa sentence. and not This is a sentence.

But, I think we should be able to strip leading and trailing spaces from the text as a whole, I think the following fragment below should be semantically identical to the first fragment. The fact that in turned into the fragment above is probably because of standard XML prettification algorithms.

<p xml:id="FH-OllevierGeets-1.tif.text.par_1_21"

< t>INTRODUCTION</t

< t class="OCR">INTRODUCTION</t

< str xml:id="FH-OllevierGeets-1.tif.text.par_1_21.word_1_233"

< t offset="0">INTRODUCTION</t

< t offset="0" class="OCR">INTRODUCTION</t

< relation format="text/hocr+xml" xlink:href="OllevierGeets-1.hocr" xlink:type="simple"

< xref id="word_1_233" type="str" /> </ relation

</ str

</ p> Just like we don't allow empty texts, I think we can probably strip leading and trailing spaces (=emptiness) when doing text validation and offset computation (this does not affect any intermediate spaces, also not in multiline content!).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

proycon commented 3 years ago

@pirolen If you think it's a tokenisation issue then it's best to put it in https://github.com/LanguageMachines/ucto/issues . If you're referring to insertion/deletion corrections in FLAT then best to put it in https://github.com/proycon/flat/issues

kosloot commented 3 years ago

I tried to reproduce this problem, but folialint failed to fail

Are you sure this isn't already fixed on Nov 17:

commit 64218577550c6f3763dbbc75f668252fd4f3f03d Author: Ko van der Sloot K.vanderSloot@let.ru.nl Date: Tue Nov 17 15:38:41 2020 +0100

Fixed problem with text-conststency errors for within

Or maybe it is very related?

UPDATE: Sorry. :{ I was able to get an error using your example: issue88.2.4.1.folia.xml

proycon commented 3 years ago

I think I tackled this now in libfolia as well, I'll continue by testing it in the PICCL context where the issue emerged.

proycon commented 3 years ago

I'm afraid our problems with whitespace are not over yet. I take the example @kosloot gave in LanguageMachines/foliautils#56.

This output has been formatted this way by libxml2 itself, but this formatting is not compatible with the FoLiA assumptions we held until now:

       <t>
        <t-str xml:id="text.p.1.t-str.1">
          <t-style>deel<t-hbr/></t-style>
        </t-str>
        <t-str xml:id="text.p.1.t-str.2">
          <t-style>woord</t-style>
        </t-str>
        <t-str>extra</t-str>
      </t>

With the current rules we applied, the text representation that both foliapy and libfolia give is:

deel
        woord
        extra

Also if we simplify the example to:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style>
          <t-str>extra</t-str>
      </t>

We get that same result.

The extra bonus is that as soon as we add a space prior to the word extra, that libxml2 serializes the whole <t> block in a single line!! Which is far more in line what we intend FoLiA (except for the fact that the leading space would be stripped).

I don't think the text representations are good as they are, with all the indentation, and I think what we're getting now is at odds with how XML sees things. I think what we want in this case is one of two options:

we want the text "deelwoordextra" (without any intermediate spaces), so stripping ALL the initial and trailing spaces outside the markup elements.
The alternative interpretation is to go for the text "deel woord extra", with a single space between all the parts. This would be in line with what HTML does:

<span>
    <span>deel</span>
    <span>woord</span>
    <span>extra</span>
</span>

(see https://download.anaproy.nl/deelwoordextra.html)

If we go for option 1, this does beg the question how we would represent a space if we do want it, say for example between woord and extra. I think the solution to that would be:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style> <t-str>extra</t-str>
      </t>

If we go for option 2, then it begs the question how we would represent the non-spaced scenario, the solution would be:

       <t>
          <t-style>deel<t-hbr/></t-style>
          <t-style>woord</t-style><t-str>extra</t-str>
      </t>

I think we're currently closer to option 1 in our interpretations, but I need to do some investigation whether option 2 isn't the more natural XML interpretation (after all, it's what HTML does too). Whatever we choose, we have to take into account the fact that twe didn't impose this strictness before and therefore be lenient not to break older files, as addressed in issue #92.

Of course, the one-line solution avoids all these problems in all cases and is the simplest, but it's apparently not what libxml2 prefers to output (pretty formatting), nor something we can expect users to adhere too:

       <t><t-style>deel<t-hbr/></t-style><t-style>woord</t-style> <t-str>extra</t-str></t>

It would be good if we had a way to normalize our FoLiA's to force this one-line representation (as an extra tool), because it would be a valuable preprocessing step that can solve issues like proycon/foliatools#29 and make things easier for parsers that can't deal with all these complexities.

kosloot commented 3 years ago

Hmm, it truly is complex. I ponder about the <t-hbr/> in your example. Shouldn't that yield

deelwoord extra

deelwoord
extra

or

deel-
woord
extra

or such? Anyway not just a space after 'deel' I assume, but some representation of the <t-hbr/>.

proycon commented 3 years ago

Ah yes, possibly, I didn't consider any representation of t-hbr . I don't think we currently represent it even, do we? Let's save that for another issue :)

kosloot commented 3 years ago

Well, it was the source for https://github.com/LanguageMachines/foliautils/issues/56 One of the heads of this dragon

pirolen commented 3 years ago

After tokenization with ucto, the t-hbr is gone/turned into a token boundary. In my ideal workflow, the soft break would stay recoverable (and propagatable to FLAT and folia2html), if possible at all.

proycon commented 3 years ago

It would help if we could discuss the t-hbr stuff in #56 or a separate ucto issue yes.

kosloot commented 3 years ago

I will respond to @pirolen in https://github.com/LanguageMachines/foliautils/issues/56

proycon commented 3 years ago

Another real-life example (from Nederlab DBNL) to consider in this context, hints also at solution 2:

<t>I <t-style class="i">Buiten- en binnenlandse hoogleraren, lectoren en
                             oud-docenten in de neerlandistiek, sprekers, bestuurs- en stafleden van
                             de IVN</t-style>.</t>

kosloot commented 3 years ago

maybe we could allow the space="no" attribute in <t- elements? (maybe ALL AbstractTextMarkup?)

proycon commented 3 years ago

That would work, but I'm very reluctant to add further complexity to an already complex matter.

kosloot commented 3 years ago

ok, but we would use existing constructions, and complexity is already proven :)

proycon commented 3 years ago

It's also relevant to look at how TEI solves this issue by the way, we can use a lot from there and aim to do something similar: https://wiki.tei-c.org/index.php/XML_Whitespace

proycon commented 3 years ago

I added a test file that I think describes our desired situation, here we pick option 2. All of the following paragraphs are then functionally identical when it comes to text serialisation, the text being deel woord extra for all.

     <p xml:id="test.p.1">
       <t>
        <t-str>
          <t-style>deel</t-style>
        </t-str>
        <t-str>
          <t-style>woord</t-style>
        </t-str>
        <t-str>extra</t-str>
      </t>
    </p>

     <p xml:id="test.p.2">
       <t>
          <t-style>deel</t-style>
          <t-style>woord</t-style>
          <t-str>extra</t-str>
      </t>
    </p>

     <p xml:id="test.p.3">
       <t>
           deel
           woord
           extra
      </t>
    </p>

     <p xml:id="test.p.4">
       <t>
          <t-style>deel</t-style> <t-style>woord</t-style> <t-str>extra</t-str>
      </t>
    </p>

     <p xml:id="test.p.5">
       <t>
          <t-style>  deel  </t-style> <t-style>   woord </t-style> <t-str>      extra</t-str>
      </t>
      <comment>all leading/trailing spacing is removed (functionally identical to p.4)</comment>
    </p>

Unlike HTML, I think we do want to retain the ability to handle multiple spaces (to represent an untokenised original accurately without needing to resort to CDATA). But this is up for debate. Consider this example which is different from the five before:

     <p xml:id="test.p.6">
       <t>
          <t-style>deel</t-style>  <t-style>woord</t-style>  <t-str>extra</t-str>
      </t>
      <comment>There are DOUBLE spaces between the three words, which should be preserved</comment>
    </p>

I'm already mostly-there implementation-wise in foliapy. It does stir things up so I suggest we bump FoLiA to 2.5 for this. The text serialisation is different compared to the pre v2.4.1 situation (for which we have some backward compatibility checks as introduces in #92) , but also for the situation we created since v2.4.1 - v2.5.0 (I'm hoping we don't need special implementations there)

proycon commented 3 years ago

This is the new folia2txt output for the issue88b example, and I think it's correct now:

deel woord extra

deel woord extra

deel woord extra

deel woord extra

deel woord extra

deel woord extra

deel  woord  extra

  deel      woord        extra

deelwoord extra

deel
woord
extra

I Buiten- en binnenlandse hoogleraren, lectoren en oud-docenten in de neerlandistiek, sprekers, bestuurs- en stafleden van de IVN.

I Buiten- en binnenlandse hoogleraren, lectoren en oud-docenten in de neerlandistiek, sprekers, bestuurs- en stafleden van de IVN .

Es entspricht einerseits nicht den Erwartungen derjenigen, welche in betreff der Lage der Landarbeiter nur solche

proycon commented 3 years ago

A remaining issue, raised by @kosloot, is whether we should actively normalize the more exotic unicode spaces ( see https://en.wikipedia.org/wiki/Whitespace_character#Unicode) to a normal space. This is probably a good idea, but we may need to introduce an explicit <t-hspace> element in case people want to explicitly specify things like space width.

pirolen commented 3 years ago

Thanks! Some more test examples from me would include superscript styling, where the superscripted characters would ideally be adjacent without whitespace to their context on the left and sometimes right, in examples 2 and 3:

1.

<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.5">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>im wirtschaftlichen Interessenkampf gegen die Agrarpartei verwert<t-hbr/></t-style>
          </t-str>
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.6">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>baren Schauergemälde bieten</t-style>
            <t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>6</t-style>
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>, oder welche die Agrarverhältnisse</t-style>
          </t-str>

2.

        <t class="OCR">
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.1">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>Um nicht gewisse Bemerkungen über die Arbeitsverfassung im</t-style>
          </t-str>
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p3.t-str.2">
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>ganzen</t-style>
            <t-style><feat subset="font_typeface" class="superscript"/><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>1</t-style>
            <t-style><feat subset="font_family" class="Times New Roman"/><feat subset="font_size" class="15."/><feat subset="font_style" class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}"/>) bei jedem einzelnen Bezirk wiederholen zu müssen, habe</t-style>
          </t-str>
        </t>

3.

        <t class="OCR">
          <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p4.t-str.1">
            <t-style><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>1</t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="12." subset="font_size"/><feat class="{6B4F7D42-EA4B-4F65-B62C-458C902232DA}" subset="font_style"/>) Grundlage bleibt nach wie vor in dieser Beziehung die Schrift von v.d. Goltz,</t-style>
          </t-str>
        </t>

proycon commented 3 years ago

@pirolen To accomplish that in the new situation, there can not be a newline between the two elements (so they must be on the same line). I think this is generated by FoLiA-abby right? We'll have to make sure it produces proper FoLiA in such cases.

pirolen commented 3 years ago

Yes, the examples come from FoLiA-abby.

kosloot commented 3 years ago

We have to look into this as soon as all text issues have been resolved. At the moment it is a moving target.

kosloot commented 3 years ago

Still I think we are getting into trouble anyway. To illustrate the dilemma a simplified example:

Original text: item1² Possible FoLiA text: (as @pirolen would like to see it, I suppose)

<t>
  <t-str>item1<t-style><feat class="superscript" subset="font_typeface"/>2</t-style></t-str>
</t>

When using ucto, a string will be extracted like this: item12 imho this is quite useless. For further processing, we need a way to "know" that the 2 isn't part of the word item1 Any ideas HOW to accomplish this? Inserting a space (or newline or such) in the FoLiA is a bit harsh, But still I would prefer item1 2 over item12.

In an ideal world, extracting text form the FoLiA would re-introduce the superscript, item1²but that would depend on the set used, and in general these sets can be user-defined, and are open, so any translation is possible.

I'm stuck here

proycon commented 3 years ago

I see the problem yes. Technically, following all the rules, the text serialisation item12 is correct. Inserting a space would be too harsh indeed. But I agree that from a tokenisation perspective you would indeed prefer to have item1 and 2 as different tokens. This would then indeed be a problem for the tokeniser (ucto) to tackle, but it is hard to get right and would make all kinds of assumptions we can't really make, so whatever we do would have to be an opt-in parameter I think.

In an ideal world, extracting text form the FoLiA would re-introduce the superscript, item1²but that would depend on the set used, > and in general these sets can be user-defined, and are open, so any translation is possible.

Indeed, and in general styles don't transfer to plain text. You'd need a markup language for that (like Markdown). Properly interpreting styles in custom sets can only be done by the user. We don't certainly don't want text serialisation in FoLiA libraries to even attempt that.

pirolen commented 3 years ago

Superscript and subscript are the t-style classes that would imply a token boundary, the others don't (e.g. italic, bold). Maybe these two could be treated somewhat differently from the rest, so that they always encode a non-breaking boundary (which is not a whitespace boundary)? I guess t-hbr does not apply here, but perhaps something like https://en.wikipedia.org/wiki/Zero-width_joiner ?

kosloot commented 3 years ago

@pirolen: Maarten and I were thinking in the same direction. Another candidate would be the Zero-Width-space It's up to ucto and such to interpret that as a token separator.

@proycon To make this more generic: Could we extend the t-style with an attribute like separator="true" Which would make text extraction insert that joiner or zero-width?

BUT: There is also another issue, text like: ²footnote text here the joiner/separator has to come AFTER the ². So maybe the only feasible way to do it is surrounding the text with a special symbol. It is really tricky.

pirolen commented 3 years ago

Would be nice if adding the special symbol around the t-style text element would solve it.

The whole phenomenon reminds me a bit of the choice of tags in sequence labeling, where one can use the prefixes I-O, B-I-O, etc. in combination with the applicable tag (like for a named entity), or simply use the name of the tag as the label. Each of the choices implicitly encodes a specific logic for the tools that ingest the labeled data (and for the humans who interpret them).

kosloot commented 3 years ago

More pondering on this: One problem with 'hidden' characters is their size. Do they count for offset's and string length? For instance, assuming the separated attribute is implemented:

<t>
  <t-str>item1<t-style><feat class="superscript" subset="font_typeface" separated="yes"/>2</t-style></t-str><t-str>something</t-str>
</t>

(The original text was: item² something)

What should we do here. I assume there is no need to insert 'hidden' characters here, but to implement the str() extraction function so that it does 'the right thing' But for the fragment above, should str() render: item1 2 something OR item1*2* something were * is a ZERO-WIDTH character, as we were suggesting.

This might raise a lot of problems later on. What is the offset of '2' in this string? 5 or 6? And 9 or 7 for 'something'? Same problems with the string length.

Maybe the clearest solution is, to implement the'separated'attribute, with the semantics of: when extracting text, insert a space before AND after the styled token. (and avoiding multiple spaces)

In this way we do not break any old behavior, and don't introduces fuzzy and surprising characters.

pirolen commented 3 years ago

Gut feeling: to render the separator as space would be confusing for humans (e.g. evaluators of OCR extraction), because there is visually no space before/after the sub-/superscripted text (so rather also no hidden character to add to the offset and string length count).

Would it be feasible to regard/treat sub-/superscripted text as a specific type of punctuation? :-o Semantically it seems related to it (=it aids and directs the reading of the text). But just like soft break, its behavior could be configurable. ?

proycon commented 3 years ago

Maarten and I were thinking in the same direction. Another candidate would be the Zero-Width-space. It's up to ucto and such to interpret that as a token separator.

Just to prevent confusion: I definitely don't think there should be zero-width spaces in the FoLiA itself. At most the text extraction function could output one where a token boundary must occur and no space happens, but that would have to be an opt-in feature. And as you said, I foresee issues with the offsets then. So I see where you are going with the separated attribute.

Fundamentally, the issue we're discussing now is a tokeniser issue rather than a FoLiA representation problem (so I see it as distinct form the original issue in this thread). The question is how the tokeniser decides what to tokenize and what not:

What you're essentially suggesting with the separated attribute is to encode extra information in the FoLiA that gives the tokeniser extra information.
An alternative would be to provide the information directly to the tokeniser as a parameter, something like: treat all t-style's with class superscript as separate tokens. (an FQL query might work here but libfolia doesn't implement that and that'd be too much work)

Text content on higher levels is by definition untokenised (so I'm a bit skeptic about adding tokenisation details in there), text content on the word/token level is by definition tokenised. The issue is of course getting from A to B here (which is the task of the tokeniser).

I'm following the line of the extra attribute Ko suggested. But I'm trying to think in a generic way if we expand FoLiA for this: we're essentially encoding some extra 'cue' in the FoLiA to help another tool do its job, and such a cue is needed because the information is not present in the FoLiA yet, or is too complexly encoded. This might be useful for other uses cases than the one we are considering now.

What if we introduce a generic tagattribute that allows people to tag any FoLiA element, the value being a space-delimited list of some user defined vocabulary (because it is tool-specific)? We could then use a value like token or separate for the tokenisation cues:

<t>
  <t-str>item1<t-style tag="token"><feat class="superscript" subset="font_typeface"/>2</t-style></t-str><t-str>something</t-str>
</t>

It's essentially what Ko suggested but stretched to be more generic, it gives some processor-specific flexibility. You can envision tool A setting particular tags, and tool B acting on them.

Note: I opened a new issue for this proposal, see below

proycon / folia

Problems with leading/trailing whitespace in text content #88