tafia / calamine

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets
MIT License
1.61k stars 155 forks source link

fixes #261 by supporting missing r attributes in XLSX files #330

Closed forgemo closed 1 year ago

forgemo commented 1 year ago

This PR closes #261

Problem location

Every xlsx file is basically a renamed zip archive. If you unpack the zip archive, you will find numerous files. The issue causing #261 is located in sheet xml files within the xl directory of such an archive.

grafik

Problem description

As the original poster of #261 mentioned, there are xlsx files out there that have missing r attributes. This attribute is usually found on every <row> element and <c> elements (which are cells) within a sheet xml.

It is critical to understand that it's also possible that only some r attributes are missing.

Example with r-attributes on every row and c element.

       <row r="1" spans="1:8">
            <c r="A1" s="24" t="s">
                <v>0</v>
            </c>
            <c r="B1" s="25"/>
            <c r="C1" s="25"/>
            <c r="D1" s="25"/>
            <c r="E1" s="25"/>
            <c r="F1" s="30"/>
            <c r="G1" s="30"/>
            <c r="H1" s="31"/>
        </row>
        <row r="2" spans="1:8">
            <c r="A2" s="26"/>
            <c r="B2" s="27"/>
            <c r="C2" s="27"/>
            <c r="D2" s="27"/>
            <c r="E2" s="27"/>
            <c r="F2" s="32"/>
            <c r="G2" s="32"/>
            <c r="H2" s="33"/>
        </row>

Example with missing r-attributes

       <row>
            <c s="19" t="s">
                <v>32</v>
            </c>
            <c s="9"/>
            <c s="9"/>
            <c s="9"/>
            <c s="9"/>
            <c s="10"/>
            <c s="10"/>
            <c s="36"/>
        </row>
        <row>
            <c s="31"/>
            <c s="8"/>
            <c s="8"/>
            <c s="8"/>
            <c s="8"/>
            <c s="16"/>
            <c s="16"/>
            <c s="48"/>
        </row>

Example row where only one cell has an r attribute.

        <row ht="12.75" customHeight="1">
            <c r="E61" t="s">
                <v>10</v>
            </c>
            <c s="28"/>
            <c s="44">
                <v>939</v>
            </c>
            <c s="49"/>
        </row>

The calamine error

Calamine currently isn't concerned with the row elements, and thus doesn't care if the rows don't have the r attribute. However, only one single missing r attribute on a c element, and it returns the following error.

XlsxError::CellRAttribute => write!(f, "Cell missing 'r' attribute"),

How to get a xlsx file with missing r-attributes?

It seems like xlsx files with missing r attributes aren't very common. Tools like Excel and LibreOffice can open them without any issues, though.

As the original poster of #261 suggested, that file was generated by Power BI.

In our case, it was also an automatically generated xlsx report from another tool. I, however, don't know what system or library was used to generate that report.

The test case

To understand how to handle missing r attributes correctly is not easy, since we couldn't find any documentation on it. We, however, had a little breakthrough when we found out that Excel does introduce the missing r attributes when opening and exporting the xlsx file again. (We used the online version of Excel)

This allowed us to compare the same sheet xml, with and without missing r-attributes, side by side.

To be able to share our xlsx file with you, we had to remove some confidential information. This was, however, not as easy as it sounds. We couldn't just edit and remove the information with Excel (online version), since Excel reintroduced the r-attributes when saving the file, rendering it useless for test purposes.

Therefore, we had to manually replace all confidential values with generic values in the sharedStrings.xml and also change some numeric values in the sheet1.xml.

Apart from that, the provided test xlsx file is in its layout and row/cell structure untouched.

We provided two test files:

Our issue_261 test reads both files with calamine and ensures that they both contain the same value correct values.

The requirements

Based on our analysis from comparing both files, we could derive the following requirements.

  1. The r-attribute of <row> elements contains the row index, starting with 1.
  2. If there is no r-attribute on a <row> element, its implicit index is set to the index of the previous <row> element, incremented by 1.
  3. If there is no previous <row> element for a row, the current row index is assumed to be 1.
  4. If there is no r-attribute on a <c> element, its implicit coordinate is set to the coordinate of the previous cell in the same row, incremented by one letter (e.g., A1 → B1, or AB4 → AC4 ).
  5. If there is no previous <c> element in a row for a cell, its coordinate is assumed to be A.., (e.g., A3 for first cell in a row with the index 3).
  6. If there is an existing r attribute on a <row> element, its row index is set to the r-attribute value.
  7. If there is an existing r attribute on a <c> element, its cell coordinate is set to the r-attribute value.

Fast forwarding

Based on the rules above, it is possible to fast-forward to rows / cells.

The following example shows, where a mix of implicit and explicit row indexes is used.

<row>(...)</row>              <!-- no r-attribute, no prevevious row: row index = 1 -->
<row>(...)</row>              <!-- no r-attribute, prevevious row index = 1: row index = 2 -->
<row>(...)</row>              <!-- no r-attribute, prevevious row index = 2: row index = 3 -->
<row>(...)</row>              <!-- no r-attribute, prevevious row index = 3: row index = 4 -->
<row>(...)</row>              <!-- no r-attribute, prevevious row index = 4: row index = 5 -->
<row r="7">(...)</row>    <!-- r-attribute exists, fast forward row to 7 : row index = 7 -->
<row>(...)</row>              <!-- no r-attribute, prevevious row index = 7: row index = 8 -->
<row r="10">(...)</row>  <!-- r-attribute exists, fast forward row to 10 : row index = 10 -->

The following example shows, where a mix of implicit and explicit cell coordinates is used.

<row r="66">
    <c> (...) </c>                  <!-- no r-attribute, no prevevious cell in row: cell coord = A66 -->
    <c/>                                <!-- no r-attribute, prevevious cell in row = A66: cell coord = B66 -->
    <c r="D66">(...)</c>     <!-- r-attribute exists, fast forward cell to D : cell coord = D66 -->
    <c r="F66" />                <!-- r-attribute exists, fast forward cell to F : cell coord = F66 -->
    <c/>                               <!-- no r-attribute, prevevious cell in row = F66: cell coord = G66 -->
</row>

Notable code changes in the PR

Parsing row r-attributes

The current implementation only contains a way to parse r-attributes of <c> tags, which always contain an alphabetical column component like the A in A12. The r attributes of <row> tags, however, only contain a numeric value for the row index.

Therefore, we refactored get_row_column to look like this.


/// Converts a text range name into its position (row, column) (0 based index).
/// If the row or column component in the range is missing, an Error is returned.
fn get_row_column(range: &[u8]) -> Result<(u32, u32), XlsxError> {
  /// calls `get_row_and_optional_column`
}

/// Converts a text row name into its position (0 based index).
/// If the row component in the range is missing, an Error is returned.
/// If the text row name also contains a column component, it is ignored.
fn get_row(range: &[u8]) -> Result<u32, XlsxError> {
 /// calls `get_row_and_optional_column`
}

/// Converts a text range name into its position (row, column) (0 based index).
/// If the row component in the range is missing, an Error is returned.
/// If the column component in the range is missing, an None is returned for the column.
fn get_row_and_optional_column(range: &[u8]) -> Result<(u32, Option<u32>), XlsxError> {
 /// (...)
}

This allowed us to reuse most of the existing, quite efficient, parsing code for r-attributes.

Underflow handling

While we were already refactoring that code part, we took the opportunity to improve the underflow handling by replacing saturating_sub with checked_sub. This allows us to handle underflows as actual errors, instead of just “hiding them”.

/// old
Ok((row.saturating_sub(1), col - 1))

/// new
 let row = row
        .checked_sub(1)
        .ok_or(XlsxError::RangeWithoutRowCompontent)?;
    Ok((row, col.checked_sub(1)))

Errors

Since the whole point of this PR is to handle missing r-attributes, the XlsxError::CellRAttribute error is no longer needed. Instead, we introduced two new errors for situations when parsing the r-attributes value leads to unexpected results.

When parsing the r-attribute of a <row> element, there must be a numeric row component. When parsing the r-attribute of a <c> element, there must be a numeric row and alphabetical column component.

XlsxError::RangeWithoutColumnComponent => {
    write!(f, "Range is missing the expected column component.")
}

XlsxError::RangeWithoutRowCompontent => {
    write!(f, "Range is missing the expected row component.")
}

Tracking columns / rows

The most important code change happened in the read_sheet function. We introduced two variables to keep track of the current row / column and adapted the loop according to the reverse-engineered requirements, listed above.

You can review the changed code in the last commit of this PR to see the implementation details. :)

Further notes

Sponsoring

The work on this PR was sponsored by Comsysto Reply.

Comsysto Reply - LOGO RGB_small

tafia commented 1 year ago

Thanks a lot for writing all this up!