hungrybluedev / xlsx

V library to add support for Microsoft Excel files.
MIT License
18 stars 2 forks source link

The module fails parsing XLSX files, produced by LibreOffice #5

Closed spytheman closed 9 months ago

spytheman commented 9 months ago

I opened LibreOffice Calc, typed abc in the first cell, then saved the file as: image

It produced this file: abc.xlsx

I then tried to open it with a V program containing:

import os
import xlsx

fn main() {
        path := os.args[1]!
        workbook := xlsx.Document.from_file(path)!
        println('[info] Successfully loaded workbook with ${workbook.sheets.len} worksheets.')

        println('\nAvailable sheets:')
        // sheets are stored as a map, so we can iterate over the keys.
        for index, key in workbook.sheets.keys() {
                println('${index + 1}: "${key}"')
        }

        // Excel uses 1-based indexing for sheets.
        sheet1 := workbook.sheets[1]

        // Note that the Cell struct is able to the CellType.
        // So we can have an idea of what to expect before getting all
        // the data as a dataset with just string data.
        dataset := sheet1.get_all_data()!
        count := dataset.row_count()
        println('\n[info] Sheet 1 has ${count} rows.')
        headers := dataset.raw_data[0]

        println('\nThe headers are:')
        for index, header in headers {
                println('${index + 1}. ${header}')
        }
}

The result was:

V panic: result not set (Failed to parse sheet file: /tmp/xlsx-9e6dea3823/xl/worksheets/sheet1.xml)
v hash: 4f742ad
/tmp/v_1000/open_xlsx.01HPS4CXG8VMVQ5PENTVR2YZ4Z.tmp.c:9911: at _v_panic: Backtrace
/tmp/v_1000/open_xlsx.01HPS4CXG8VMVQ5PENTVR2YZ4Z.tmp.c:9876: by panic_result_not_set
/tmp/v_1000/open_xlsx.01HPS4CXG8VMVQ5PENTVR2YZ4Z.tmp.c:25835: by main__main
/tmp/v_1000/open_xlsx.01HPS4CXG8VMVQ5PENTVR2YZ4Z.tmp.c:26018: by main
#1 16:16:19 ᛋ master /v/oo❱

The xml in that tmp folder is:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"><sheetPr filterMode="false"><pageSetUpPr fitToPage="false"/></sheetPr><dimension ref="A1"/><sheetViews><sheetView showFormulas="false" showGridLines="true" showRowColHeaders="true" showZeros="true" rightToLeft="false" tabSelected="true" showOutlineSymbols="true" defaultGridColor="true" view="normal" topLeftCell="A1" colorId="64" zoomScale="100" zoomScaleNormal="100" zoomScalePageLayoutView="100" workbookViewId="0"><selection pane="topLeft" activeCell="A2" activeCellId="0" sqref="A2"/></sheetView></sheetViews><sheetFormatPr defaultColWidth="11.53515625" defaultRowHeight="12.8" zeroHeight="false" outlineLevelRow="0" outlineLevelCol="0"></sheetFormatPr><sheetData><row r="1" customFormat="false" ht="12.8" hidden="false" customHeight="false" outlineLevel="0" collapsed="false"><c r="A1" s="0" t="s"><v>0</v></c></row></sheetData><printOptions headings="false" gridLines="false" gridLinesSet="true" horizontalCentered="false" verticalCentered="false"/><pageMargins left="0.7875" right="0.7875" top="1.05277777777778" bottom="1.05277777777778" header="0.7875" footer="0.7875"/><pageSetup paperSize="9" scale="100" firstPageNumber="1" fitToWidth="1" fitToHeight="1" pageOrder="downThenOver" orientation="portrait" blackAndWhite="false" draft="false" cellComments="none" useFirstPageNumber="true" horizontalDpi="300" verticalDpi="300" copies="1"/><headerFooter differentFirst="false" differentOddEven="false"><oddHeader>&amp;C&amp;&quot;Times New Roman,Regular&quot;&amp;12&amp;A</oddHeader><oddFooter>&amp;C&amp;&quot;Times New Roman,Regular&quot;&amp;12Page &amp;P</oddFooter></headerFooter></worksheet>
MedLabs commented 9 months ago

first I tried with an Excel file from a gov administration, I'm pretty sure they use Microsoft Excel, here is the link of the file: https://spot.tn/media/articles/LISTE%20DES%20MEDICAMENTS%20CLASSES%20EN%20V%20E%20I%20COUVERTS%20PAR%20LE%20REGIME%20DE%20BASE%2022-01-2024.xls

(notice it's a .xls)

Then I tested with this file that I created in libreOffice Calc and saved as Excel2007-365 (.xlsx) same problem test.xlsx

hungrybluedev commented 9 months ago

I'll take a look at it soon. I was working on the write feature anyway.

@spytheman Can I ask you to see if the encoding.xml parser is able to parse the XML?

JalonSolov commented 9 months ago

I just tried it with spy's file and this tiny example V file, which I named xml.v:

import encoding.xml

doc := xml.XMLDocument.from_file('abc.xlsx')!
println(doc)

Result was

[jalon@7950x ~]$ v run xml.v
V panic: result not set (Expecting a prolog or root node starting with "<".)
v hash: c256013
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:8856: at _v_panic: Backtrace
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:8821: by panic_result_not_set
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:20743: by main__main
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:20856: by main
[jalon@7950x ~]$

However, in looking at the file - it's compressed. First 2 bytes of the file are PK, meaning it is zip'd.

If I try it with the uncompressed XML, it works fine.

spytheman commented 9 months ago

Yes, parsing the uncompressed XML is ok:

import encoding.xml

content := '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"><sheetPr filterMode="false"><pageSetUpPr fitToPage="false"/></sheetPr><dimension ref="A1"/><sheetViews><sheetView showFormulas="false" showGridLines="true" showRowColHeaders="true" showZeros="true" rightToLeft="false" tabSelected="true" showOutlineSymbols="true" defaultGridColor="true" view="normal" topLeftCell="A1" colorId="64" zoomScale="100" zoomScaleNormal="100" zoomScalePageLayoutView="100" workbookViewId="0"><selection pane="topLeft" activeCell="A2" activeCellId="0" sqref="A2"/></sheetView></sheetViews><sheetFormatPr defaultColWidth="11.53515625" defaultRowHeight="12.8" zeroHeight="false" outlineLevelRow="0" outlineLevelCol="0"></sheetFormatPr><sheetData><row r="1" customFormat="false" ht="12.8" hidden="false" customHeight="false" outlineLevel="0" collapsed="false"><c r="A1" s="0" t="s"><v>0</v></c></row></sheetData><printOptions headings="false" gridLines="false" gridLinesSet="true" horizontalCentered="false" verticalCentered="false"/><pageMargins left="0.7875" right="0.7875" top="1.05277777777778" bottom="1.05277777777778" header="0.7875" footer="0.7875"/><pageSetup paperSize="9" scale="100" firstPageNumber="1" fitToWidth="1" fitToHeight="1" pageOrder="downThenOver" orientation="portrait" blackAndWhite="false" draft="false" cellComments="none" useFirstPageNumber="true" horizontalDpi="300" verticalDpi="300" copies="1"/><headerFooter differentFirst="false" differentOddEven="false"><oddHeader>&amp;C&amp;&quot;Times New Roman,Regular&quot;&amp;12&amp;A</oddHeader><oddFooter>&amp;C&amp;&quot;Times New Roman,Regular&quot;&amp;12Page &amp;P</oddFooter></headerFooter></worksheet>'
dump(content.len)
x := xml.XMLDocument.from_string(content)!
y := x.validate()!
assert y.str().len > 0

prints [z.v:5] content.len: 2042, and the assertion passes.

spytheman commented 9 months ago

The XLSX file's content is:

#0 19:46:40 ᛋ master /v/oo❱unzip -l ~/Documents/abc.xlsx
Archive:  /home/delian/Documents/abc.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
      571  2024-02-16 14:10   _rels/.rels
      878  2024-02-16 14:10   xl/workbook.xml
     4451  2024-02-16 14:10   xl/styles.xml
     2042  2024-02-16 14:10   xl/worksheets/sheet1.xml
      549  2024-02-16 14:10   xl/_rels/workbook.xml.rels
      199  2024-02-16 14:10   xl/sharedStrings.xml
      731  2024-02-16 14:10   docProps/core.xml
      378  2024-02-16 14:10   docProps/app.xml
     1367  2024-02-16 14:10   [Content_Types].xml
---------                     -------
    11166                     9 files
#0 19:48:31 ᛋ master /v/oo❱
spytheman commented 9 months ago

For comparison, the content of the .xlsx file in the examples (that works) is:

#0 19:48:31 ᛋ master /v/oo❱unzip -l ~/.vmodules/xlsx/examples/01_marksheet/data.xlsx
Archive:  /home/delian/.vmodules/xlsx/examples/01_marksheet/data.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1296  1980-01-01 00:00   [Content_Types].xml
      588  1980-01-01 00:00   _rels/.rels
     2295  1980-01-01 00:00   xl/workbook.xml
      831  1980-01-01 00:00   xl/_rels/workbook.xml.rels
     5088  1980-01-01 00:00   xl/worksheets/sheet1.xml
     8390  1980-01-01 00:00   xl/theme/theme1.xml
     2264  1980-01-01 00:00   xl/styles.xml
      822  1980-01-01 00:00   xl/sharedStrings.xml
      496  1980-01-01 00:00   xl/calcChain.xml
      623  1980-01-01 00:00   docProps/core.xml
      784  1980-01-01 00:00   docProps/app.xml
---------                     -------
    23477                     11 files
#0 19:49:09 ᛋ master /v/oo❱
spytheman commented 9 months ago

What fails is the sheet := Sheet.from_doc(sheet_name, sheet_doc, shared_strings) call. The error in the panic is misleading (it duplicates the error for parsing the XML file).

The call above sheet_doc := xml.XMLDocument.from_file(sheet_path) or { passes fine @hungrybluedev . I think that at the very least, the errors should be more detailed and different (including the err from the lower layers, not silently dropping it, like now).

spytheman commented 9 months ago

[/home/delian/.vmodules/xlsx/src/parser.v:104] err: Row does not include span.

spytheman commented 9 months ago

[/home/delian/.vmodules/xlsx/src/parser.v:142] row.attributes: {'r': '1', 'customFormat': 'false', 'ht': '12.8', 'hidden': 'false', 'customHeight': 'false', 'outlineLevel': '0', 'collapsed': 'false'}

spytheman commented 9 months ago

The "row" in question seems to be from this piece of XML:

  <sheetData>
    <row r="1" customFormat="false" ht="12.8" hidden="false" customHeight="false" outlineLevel="0" collapsed="false">
      <c r="A1" s="0" t="s">
        <v>
          0
        </v>
      </c>
    </row>
  </sheetData>
spytheman commented 9 months ago

Rows from the working file look like this:

  <sheetData>
    <row r="1" spans="1:9" s="1" customFormat="1" x14ac:dyDescent="0.3">
      <c r="A1" s="1" t="s">
        <v>
          0
        </v>
      </c>
      ...
    </row>
    ...
  <sheetData>
spytheman commented 9 months ago

I could get it to work with this local patch:

#0 20:23:09 ᛋ fix_example ~/.vmodules/xlsx❱git diff
diff --git src/parser.v src/parser.v
index 79ae736..2ead1ce 100644
--- src/parser.v
+++ src/parser.v
@@ -135,7 +135,7 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
                row_label := row.attributes['r'] or { return error('Row does not include location.') }
                row_index := row_label.int() - 1

-               span_string := row.attributes['spans'] or { return error('Row does not include span.') }
+               span_string := row.attributes['spans'] or { '1:1' }

                span := span_string.split(':').map(it.int())
                cell_count := span[1] - span[0] + 1
@@ -185,7 +185,6 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
                        cells: cells
                }
        }
-
        return Sheet{
                name: name
                rows: rows
diff --git src/query.v src/query.v
index f3cfc5f..3b4de77 100644
--- src/query.v
+++ src/query.v
@@ -16,9 +16,6 @@ pub fn (sheet Sheet) get_all_data() !DataFrame {
 }

 pub fn (sheet Sheet) get_data(top_left Location, bottom_right Location) !DataFrame {
-       if top_left == bottom_right {
-               return DataFrame{}
-       }
        if top_left.row >= sheet.rows.len {
                return error('top_left.row out of range')
        }
#0 20:23:10 ᛋ fix_example ~/.vmodules/xlsx❱

... and then:

import xlsx

fn main() {
    workbook := xlsx.Document.from_file('abc.xlsx')!
    println('[info] Successfully loaded workbook with ${workbook.sheets.len} worksheets.')
    println('\nAvailable sheets:')
    for index, key in workbook.sheets.keys() {
        println('   sheet ${index+1} has key: "${key}"')
    }
    sheet1 := workbook.sheets[1]
    dataset := sheet1.get_all_data()!
    count := dataset.row_count()
    println('\n[info] Sheet 1 has ${count} rows.')
    headers := dataset.raw_data[0]
    println('\nThe headers are:')
    for index, header in headers {
        println('${index + 1}. ${header}')
    }
}

This however makes one of the tests of the module to fail, and I do not know the module enough to fix the error:

 FAIL  [2/5] C:  752.3 ms, R:     6.664 ms /home/delian/.vmodules/xlsx/spec/01_empty_file/empty_test.v
         retry: 1
      comp_cmd: '/space/v/oo/v' -skip-running   -o '/tmp/v_1000/tsession_7f6a3f911b80_201668685110535/01HPSJM820B18X879M4Y3BJTY8/empty_test' '/home/delian/.vmodules/xlsx/spec/01_empty_file/empty_test.v'
       run_cmd: '/tmp/v_1000/tsession_7f6a3f911b80_201668685110535/01HPSJM820B18X879M4Y3BJTY8/empty_test'
failure code: 1; foutput.len: 393; failure output:
---------------------------------------------------------------------------------------------------- retry: 0 ; max_retry: 0 ; r.exit_code: 1 ; trimmed_output.len: 225
/home/delian/.vmodules/xlsx/spec/01_empty_file/empty_test.v:10: ✗ fn test_empty failed propagation with error: top_left.row out of range
   10 |         assert sheet.get_all_data()! == xlsx.DataFrame{}

To me, it seems logical, that a sheet, that has a single cell filled, should be supported 🤔 . Perhaps it is just an off by one error in one of the checks?

spytheman commented 9 months ago

This patch seems to work better (the tests pass too):

#0 20:33:56 ᛋ fix_example ~/.vmodules/xlsx❱git diff
diff --git src/parser.v src/parser.v
index 79ae736..d4da2ad 100644
--- src/parser.v
+++ src/parser.v
@@ -135,7 +135,7 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
                row_label := row.attributes['r'] or { return error('Row does not include location.') }
                row_index := row_label.int() - 1

-               span_string := row.attributes['spans'] or { return error('Row does not include span.') }
+               span_string := row.attributes['spans'] or { '0:0' }

                span := span_string.split(':').map(it.int())
                cell_count := span[1] - span[0] + 1
@@ -185,7 +185,6 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
                        cells: cells
                }
        }
-
        return Sheet{
                name: name
                rows: rows
diff --git src/query.v src/query.v
index f3cfc5f..b53c996 100644
--- src/query.v
+++ src/query.v
@@ -16,7 +16,7 @@ pub fn (sheet Sheet) get_all_data() !DataFrame {
 }

 pub fn (sheet Sheet) get_data(top_left Location, bottom_right Location) !DataFrame {
-       if top_left == bottom_right {
+       if bottom_right.row == 0 && bottom_right.row == 0 && sheet.rows.len == 0 {
                return DataFrame{}
        }
        if top_left.row >= sheet.rows.len {
#0 20:33:59 ᛋ fix_example ~/.vmodules/xlsx❱
spytheman commented 9 months ago

The fix is in https://github.com/hungrybluedev/xlsx/pull/6 .

spytheman commented 9 months ago

The PR also fixes opening the test.xlsx file, that @MedLabs uploaded.

#0 20:42:58 ᛋ master /v/oo❱v run open_xlsx.v ~/Downloads/test.xlsx
[info] Successfully loaded workbook with 1 worksheets.

Available sheets:
   sheet 1 has key: "1"

[info] Sheet 1 has 4 rows.

The headers are:
1. name
2. description
#0 20:43:10 ᛋ master /v/oo❱
spytheman commented 9 months ago

The .xls file is imho out of scope, my file program identifies it as: Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1256, while for the .xlsx, it shows Microsoft Excel 2007+.

It is definitely not a ZIP archive, containing XML.

JalonSolov commented 9 months ago

xls and xlsx are 2 very different formats...

"The primary difference between XLS and XLSX files lies in the underlying file format.

XLS files are based on the Binary Interchange File Format (BIFF) and store information in binary format.

On the other hand, XLSX files are XML-based with a more open structure, making them compatible with not only Microsoft Windows but also MacOS, iOS, and Android platforms."

https://spreadsheetplanet.com/xls-vs-xlsx-files/