proycon / foliatools

A number of command-line tools for working with FoLiA (Format for Linguistic Annotation). Includes validators, converters, visualisers, and more.
GNU General Public License v3.0
10 stars 4 forks source link

How to handle empty cells in folia2txt and FoLiA-2text #41

Closed kosloot closed 5 months ago

kosloot commented 3 years ago

Both folia2txt and FoLiA-2text do not handle empty <cell> nodes in a table correctly, imho. I consider this to be a bug.

see this file: cell_problem.xml.txt

It has 2 rows, with each 3 cells, but the upper-left cell is empty. Both folia2txt and FoLiA-2text output:

Kop 2 | Kop 3
Rij 2 | Veld 2 | Veld 3

Which is wrong. The correct result should be:

 | Kop 2 | Kop 3
Rij 2 | Veld 2 | Veld 3

(leaving proper layout to a later moment :P )

Entering an empty text in the upper-left cell is impossible, as empty strings are forbidden in FoLiA. What should we do? Adapt the programs to handle empty cells? As a last resort we could add some marker in the cell that it is empty, and use that.

kosloot commented 3 years ago

Sidenote: empty <row/> are also totally ignored. This is a lesser problem, but still....

proycon commented 3 years ago

Agreed, this sounds like a bug

kosloot commented 3 years ago

I think this is an example of a more fundamental question: Do Structure elements carry some textual information, even when they don't contain any text?

For a <cell> the answer seems YES. But are there other examples? Empty Paragraphs? Empty Sentences? I don't know.

pirolen commented 3 years ago

Maybe also: empty Definition of an Entry?

kosloot commented 3 years ago

Maybe also: empty Definition of an Entry?

I'm not sure what you suggest here. these are not FoLiA constructions

pirolen commented 3 years ago

I meant this, does it make sense? https://foliapy.readthedocs.io/en/latest/_autosummary/folia.main.Definition.html

pirolen commented 3 years ago

Maybe also: https://foliapy.readthedocs.io/en/latest/_autosummary/folia.main.ListItem.html

kosloot commented 3 years ago

I stand corrected. Entry and Definition ARE indeed FoLiA elements. Not that I have seen a lot of examples with those. Indeed these seem possible candidates for special treatment when (un-)intentional left empty. @proycon we should give this some thought

proycon commented 3 years ago

I think this is an example of a more fundamental question: Do Structure elements carry some textual information, even when they don't contain any text?

For a the answer seems YES. But are there other examples? Empty Paragraphs? Empty Sentences? I don't know.

This indeed seems the fundamental question which we hadn't considered earlier. The only three I can think of that fit this are <cell> and perhaps <row> and maybe even <item>. We would need an additional mechanism to accommodate this in the libraries.

kosloot commented 3 years ago

We would need an additional mechanism to accommodate this in the libraries.

First thought: it has no <t> children, isn't that enough of a clue? But of course those could be embedded in <p> or such. And maybe there can be (non authoritative) text in other children too? Maybe it is more clear to introduce an "isempty" attribute? or "textless" or...

And would such an attribute be on the <t> like <t isempty="1" /> or on it's (direct?) parent: <cell isempty="1"/> or <cell><p isempty="1"/><cell>

Making it a property on <t> is maybe the easiest way.

kosloot commented 2 years ago

I constructed a more elaborate example to illustrate some problems:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="tabel" generator="libfolia-v2.10" version="2.5.0">
  <metadata type="native">
    <annotations>
      <paragraph-annotation set="FoLiA-abby-set"/>
      <division-annotation set="FoLiA-abby-set"/>
      <string-annotation set="FoLiA-abby-set"/>
      <table-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="tabel.text">
    <div xml:id="tabel.text.div.1">
      <table xml:id="tabel.">
        <row xml:id="tabel.row.1">
          <cell xml:id="tabel.row.1.cell.1">
            <p xml:id="tabel.row.1.cell.1.p.1">
              <t>rij 1 veld 1</t>
            </p>
      </cell>
          <cell xml:id="tabel.row.1.cell.2">
            <p xml:id="tabel.row.1.cell.2.p.1">
              <t>rij 1 veld 2</t>
              <t class="eng">row 1 field 2</t>
            </p>
          </cell>
          <cell xml:id="tabel.row.1.cell.3">
            <p xml:id="tabel.row.1.cell.3.p.1">
              <t>rij 1 veld 3</t>
              <t class="eng">row 1 field 3</t>
            </p>
          </cell>
        </row>
        <row xml:id="tabel.row.5">
          <cell xml:id="tabel.row.5.cell.1">
            <p xml:id="tabel.row.5.cell.1.p.1">
              <t>rij 2 veld 1</t>
              <t class="eng">row 2 field 1</t>
        </p>
          </cell>
          <cell xml:id="tabel.row.5.cell.2">
            <p xml:id="tabel.row.5.cell.2.p.1">
              <t>rij 2 veld 2 a</t>
              <t class="eng">row 2 field 2 a</t>
            </p>
            <p xml:id="tabel.row.5.cell.2.p.2">
              <t>rij 2 veld 2 b</t>
              <t class="eng">row 2 field 2 b</t>
            </p>
          </cell>
          <cell xml:id="tabel.row.5.cell.3">
            <p xml:id="tabel.row.5.cell.3.p.1">
              <t>rij 2 veld 3</t>
              <t class="eng">row 2 field 3</t>
            </p>
          </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

This is a table with 2 rows of 3 cells. Every cell has text in the 'current' textclass, and all but the first also in the 'eng' textclass.

When using folia2txt or FoLiA-2text to extract the text, you will get:

using textclass 'current':

rij 1 veld 1 | rij 1 veld 2 | rij 1 veld 3
rij 2 veld 1 | rij 2 veld 2 a

rij 2 veld 2 b | rij 2 veld 3

using textclass 'eng':

row 1 field 2 | row 1 field 3
row 2 field 1 | row 2 field 2 a

row 2 field 2 b | row 2 field 3

There are 2 major issues here:

  1. The already mentioned incorrect handling of the empty text for the 'eng' case
  2. the dubious handling of the <p> nodes as paragraphs in the context of the document. IMHO they should be handled as a local element inside the cell. But this might be very tricky...

I did some hacking in libfolia for the Cell->text() function, and then i can produce this: using textclass 'current`:

rij 1 veld 1 | rij 1 veld 2 | rij 1 veld 3
rij 2 veld 1 | rij 2 veld 2 a rij 2 veld 2 b | rij 2 veld 3

using textclass 'eng':

  | row 1 field 2 | row 1 field 3
row 2 field 1 | row 2 field 2 a row 2 field 2 b | row 2 field 3

Maybe not ideal, but imho more readable and usable. But I am unsure about the implications, for instance when there was already a <t> assigned to the <text> node. That would be changed now. But: Wouldn't that be a problem already, as the delimiter '|' should be present in that text, which is quite artificial, an I have yet to see examples for tables where <t> nodes are filled on a higher level.

Discussion very welcome

proycon commented 2 years ago

There are 2 major issues here:

  1. The already mentioned incorrect handling of the empty text for the 'eng' case
  2. the dubious handling of the <p> nodes as paragraphs in the context of the document. IMHO they should be handled as a local element inside the cell. But this might be very tricky...

I see the issues yeah. The table rendering indeed has several weak spots and could use a redesign. The question is though to what extend that's worth it. Representing tables in plain text is problematic anyway and purely based on convention. You'd at least need to adhere to something like ReStructuredText or Markdown if you want it to be usable (but then you'd use a more specific folia-converter rather than to plain text), for more advanced table representation you'd go for something like LaTeX or HTML. The existing folia2html converter should already handle these tables okay, even with empty cells and multi-paragraphs.

The table representation in FoLiA has its limitations too, it's deliberately simple and there's no multicell support for example. I'm not really willing to extend FoLiA with elaborate table support at this point.

I'm also not in favour of an isempty attribute on cells, it's kind of redundant. It's more up to the text rendering logic (or rather table rendering logic) to determine and draw it correctly; but this would of course complicate that logic and force it to look more at the context. Technically an empty cell is still that, a cell without text, for which it is fine if a NoSuchText error is raised when called.

kosloot commented 2 years ago

I agree that it is complicated :)

I'm also not in favour of an isempty attribute on cells, it's kind of redundant. It's more up to the text rendering logic (or rather table rendering logic) to determine and draw it correctly;

In fact this is what my provisional solution does. No 'isempty' or such is needed. I just catch the "NoSuchtext" and replace it with a space. The already existing logic in text rendering will then automagicly add the '|' separator. (which is imho needed) SIDENOTE: I think those non-space delimiters are a very weak point in general, regarding text consistency. And probably also when offsets are used. What's the meaning of a text offset in a cell? scary.

The other change I attempted, ignoring the default \n\n delimiter of <p> is just a matter of taste, as text consistency is checked already with newlines filtered out.

So, I still advocate to consider this small adaptation as it is a nice feature for the average user. In the current situation the result of folia2txt/FoLiA-2text is rather useless. IMHO this is just a somewhat better implementation of the convention.

kosloot commented 2 years ago

As this discussion is getting confusing,I created a separate issue about enhancing offset handling: https://github.com/proycon/foliapy/issues/29

Let's focus here on the BUG about representing empty cells.

Later on I intend also to create an issue on a cleaner display of structured cell content.

kosloot commented 2 years ago

Regarding th original issue: How to handle empty Cell, I implemented In libfolia the following solution:

This seems a simple and workable solution to me. and should IMNSHO also be implemented in FoLiAPY

kosloot commented 2 years ago

In libfolia I also implemented handling of "empty" <row> elements.

We collect Text in the correct class from the children in the Row, honoring their delimiters.

proycon commented 5 months ago

This should, rather belatedly, be fixed by https://github.com/proycon/foliapy/commit/e981d19eccf81183b80561af139e5fbf20c3e3b2