mattkerlogue / tidyods

Read cells from ODS files in R
https://mattkerlogue.github.io/tidyods/
Other
4 stars 1 forks source link

Cliffhanger sentence in README #5

Closed TimTaylor closed 1 year ago

TimTaylor commented 1 year ago

There's a bit of a cliffhanger in the README when explaining what quick argument does

https://github.com/mattkerlogue/tidyods/blob/589704d106483d31476f584fcbfafa5d1791d7a7/README.Rmd#L105

mattkerlogue commented 1 year ago

Thanks Tim. Have corrected, but a more detailed explainer below that I probably need to document somewhere else in due course.

Long-story-short: XML is generally awful, ODS is specifically awful XML.

quick essentially does two things: (i) extracts a minimal set of cell info (sheet, row, col, value_type and value), (ii) it also bypasses complex text processing.

The main efficiency comes from (i) as it's extracting only a small number of specific named attributes from a cell which can be used directly in the construction of the output tibble. Whereas in the "full" process the code extracts all attributes associated with a cell, which is returned as a list and there is a processing overhead both from that extraction by {xml2} and the coercion of that list into a tibble.

On (ii), I think my refactored approach to text processing is relatively quick, but I've not tried it with sheets with a lot of complex text yet. There are three things at play here: (a) Microsoft Office and LibreOffice save repeated whitespace in different ways (I've not checked Google), (b) cell content with explicit new line breaks, and (c) comments/annotations.

{xml2} has a function xml_text() which allows you to extract the text content from within an XML tag, quick uses this function on the extracted cell XML nodes as-is. However, there are several oddities that arise through a combination of the specific peculiarities of {xml2} and the ODF XML standard. Below are four examples of ODS XML, a multi-space and multi-paragraph cell and a cell with a comment from Excel and LibreOffice.

<!--MS Excel: multi-space and multi-paragraph text-->
<table:table-cell office:value-type="string" table:style-name="ce2">
  <text:p>Cells with <text:s/>repeated <text:s text:c="2"/>spaces</text:p>
  <text:p>And multiple <text:s text:c="3"/>lines</text:p>
</table:table-cell>

<!--LibreOffice: multi-space and multi-paragraph text-->
<table:table-cell office:value-type="string" calcext:value-type="string">
  <text:p>Cells with  repeated   spaces</text:p>
  <text:p>And multiple    lines</text:p>
</table:table-cell>

<!--MS Excel: cell with comment-->
<table:table-cell office:value-type="string" table:style-name="ce2">
  <office:annotation draw:style-name="a14" svg:x="2.13541666666667in" 
  svg:y="2.15625in" svg:width="1.08333333333333in" svg:height="0.75in">
    <dc:creator>Microsoft Office User</dc:creator>
    <text:p>
      <text:span text:style-name="T1">Test comment</text:span>
    </text:p>
  </office:annotation>
  <text:p>Cell with new Excel comment</text:p>
</table:table-cell>

<!--LibreOffice: cell with comment-->
<table:table-cell office:value-type="string" calcext:value-type="string">
  <office:annotation draw:style-name="gr1" draw:text-style-name="P2"
  svg:width="2.899cm" svg:height="1.799cm" svg:x="6.62cm" svg:y="0.451cm"
  draw:caption-point-x="-0.61cm" draw:caption-point-y="0.462cm">
    <dc:date>2022-06-16T00:00:00</dc:date>
    <text:p text:style-name="P1">
      <text:span text:style-name="T1">Test comment</text:span>
    </text:p>
  </office:annotation>
  <text:p>Cell with comment</text:p>
</table:table-cell>

Applying xml2::xml_text() to these cells gives the following output

[1] "Cells with repeated spacesAnd multiple lines"         # MS
[2] "Cells with repeated   spacesAnd multiple   lines"     # LO
[3] "Microsoft Office UserTest commentCell with comment"   # MS
[4] "2022-06-16T00:00:00Test commentCell with comment"     # LO

Firstly, Microsoft have implemented the <text:s> tag of the ODF specification for handling white space replication whereas LibreOffice have not. This is not something that {xml2} is programmed for handling as its a peculiarity of the ODF specification, the first space is included as a standard white space character, but further white space characters are covered by the <text:s> tag and if more than one additional space its c attribute. Though this only becomes storage efficient once you're handling more than 23 repeated white space characters. LibreOffice have ignored this and encode repeating spaces as a standard whitespace character.

Secondly, {xml2} simply merges all text tags within a cell together without any separating space, or in the case of paragraphs a new line character, while there is an argument to xml2::xml_text() to trim trailing whitespace there is no argument that allows you to insert a joining separator (such as the sep or collapse arguments of paste0).

Finally, the text contained within the annotations are also captured as they are children of the cell, and are subject to the same peculiarities above. This could be improved by extracting just the immediate text:p children of a cell, but this requires a call to {xml2} to extract the relevant paths and as such will add a performance bottleneck.

The "full" process takes a more complex approach to text processing. Having identified the immediate text:p children of a cell it decomposes the XML into its various fragments, generates repeated whitespace and stitches together individual paragraphs and then stitches paragraphs together with a newline character to produce the cell content as expected.

[1] "Cells with  repeated   spaces\nAnd multiple    lines" # MS
[2] "Cells with  repeated   spaces\nAnd multiple    lines" # LO
[3] "Cell with comment" # MS
[4] "Cell with comment" # LO

In essence if you know that your sheet(s) meet the following criteria then quick is going to be ok:

mattkerlogue commented 1 year ago

Adding a sep argument to {xml2} has already been determined as out of scope.