Closed spytheman closed 9 months ago
first I tried with an Excel file from a gov administration, I'm pretty sure they use Microsoft Excel, here is the link of the file: https://spot.tn/media/articles/LISTE%20DES%20MEDICAMENTS%20CLASSES%20EN%20V%20E%20I%20COUVERTS%20PAR%20LE%20REGIME%20DE%20BASE%2022-01-2024.xls
(notice it's a .xls)
Then I tested with this file that I created in libreOffice Calc and saved as Excel2007-365 (.xlsx) same problem test.xlsx
I'll take a look at it soon. I was working on the write feature anyway.
@spytheman Can I ask you to see if the encoding.xml
parser is able to parse the XML?
I just tried it with spy's file and this tiny example V file, which I named xml.v
:
import encoding.xml
doc := xml.XMLDocument.from_file('abc.xlsx')!
println(doc)
Result was
[jalon@7950x ~]$ v run xml.v
V panic: result not set (Expecting a prolog or root node starting with "<".)
v hash: c256013
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:8856: at _v_panic: Backtrace
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:8821: by panic_result_not_set
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:20743: by main__main
/tmp/v_1000/xml.01HPSFBRMYX9Q01QSGS387957J.tmp.c:20856: by main
[jalon@7950x ~]$
However, in looking at the file - it's compressed. First 2 bytes of the file are PK
, meaning it is zip'd.
If I try it with the uncompressed XML, it works fine.
Yes, parsing the uncompressed XML is ok:
import encoding.xml
content := '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:xdr="http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"><sheetPr filterMode="false"><pageSetUpPr fitToPage="false"/></sheetPr><dimension ref="A1"/><sheetViews><sheetView showFormulas="false" showGridLines="true" showRowColHeaders="true" showZeros="true" rightToLeft="false" tabSelected="true" showOutlineSymbols="true" defaultGridColor="true" view="normal" topLeftCell="A1" colorId="64" zoomScale="100" zoomScaleNormal="100" zoomScalePageLayoutView="100" workbookViewId="0"><selection pane="topLeft" activeCell="A2" activeCellId="0" sqref="A2"/></sheetView></sheetViews><sheetFormatPr defaultColWidth="11.53515625" defaultRowHeight="12.8" zeroHeight="false" outlineLevelRow="0" outlineLevelCol="0"></sheetFormatPr><sheetData><row r="1" customFormat="false" ht="12.8" hidden="false" customHeight="false" outlineLevel="0" collapsed="false"><c r="A1" s="0" t="s"><v>0</v></c></row></sheetData><printOptions headings="false" gridLines="false" gridLinesSet="true" horizontalCentered="false" verticalCentered="false"/><pageMargins left="0.7875" right="0.7875" top="1.05277777777778" bottom="1.05277777777778" header="0.7875" footer="0.7875"/><pageSetup paperSize="9" scale="100" firstPageNumber="1" fitToWidth="1" fitToHeight="1" pageOrder="downThenOver" orientation="portrait" blackAndWhite="false" draft="false" cellComments="none" useFirstPageNumber="true" horizontalDpi="300" verticalDpi="300" copies="1"/><headerFooter differentFirst="false" differentOddEven="false"><oddHeader>&C&"Times New Roman,Regular"&12&A</oddHeader><oddFooter>&C&"Times New Roman,Regular"&12Page &P</oddFooter></headerFooter></worksheet>'
dump(content.len)
x := xml.XMLDocument.from_string(content)!
y := x.validate()!
assert y.str().len > 0
prints [z.v:5] content.len: 2042
, and the assertion passes.
The XLSX file's content is:
#0 19:46:40 ᛋ master /v/oo❱unzip -l ~/Documents/abc.xlsx
Archive: /home/delian/Documents/abc.xlsx
Length Date Time Name
--------- ---------- ----- ----
571 2024-02-16 14:10 _rels/.rels
878 2024-02-16 14:10 xl/workbook.xml
4451 2024-02-16 14:10 xl/styles.xml
2042 2024-02-16 14:10 xl/worksheets/sheet1.xml
549 2024-02-16 14:10 xl/_rels/workbook.xml.rels
199 2024-02-16 14:10 xl/sharedStrings.xml
731 2024-02-16 14:10 docProps/core.xml
378 2024-02-16 14:10 docProps/app.xml
1367 2024-02-16 14:10 [Content_Types].xml
--------- -------
11166 9 files
#0 19:48:31 ᛋ master /v/oo❱
For comparison, the content of the .xlsx file in the examples (that works) is:
#0 19:48:31 ᛋ master /v/oo❱unzip -l ~/.vmodules/xlsx/examples/01_marksheet/data.xlsx
Archive: /home/delian/.vmodules/xlsx/examples/01_marksheet/data.xlsx
Length Date Time Name
--------- ---------- ----- ----
1296 1980-01-01 00:00 [Content_Types].xml
588 1980-01-01 00:00 _rels/.rels
2295 1980-01-01 00:00 xl/workbook.xml
831 1980-01-01 00:00 xl/_rels/workbook.xml.rels
5088 1980-01-01 00:00 xl/worksheets/sheet1.xml
8390 1980-01-01 00:00 xl/theme/theme1.xml
2264 1980-01-01 00:00 xl/styles.xml
822 1980-01-01 00:00 xl/sharedStrings.xml
496 1980-01-01 00:00 xl/calcChain.xml
623 1980-01-01 00:00 docProps/core.xml
784 1980-01-01 00:00 docProps/app.xml
--------- -------
23477 11 files
#0 19:49:09 ᛋ master /v/oo❱
What fails is the sheet := Sheet.from_doc(sheet_name, sheet_doc, shared_strings)
call.
The error in the panic is misleading (it duplicates the error for parsing the XML file).
The call above sheet_doc := xml.XMLDocument.from_file(sheet_path) or {
passes fine @hungrybluedev . I think that at the very least, the errors should be more detailed and different (including the err
from the lower layers, not silently dropping it, like now).
[/home/delian/.vmodules/xlsx/src/parser.v:104] err: Row does not include span.
[/home/delian/.vmodules/xlsx/src/parser.v:142] row.attributes: {'r': '1', 'customFormat': 'false', 'ht': '12.8', 'hidden': 'false', 'customHeight': 'false', 'outlineLevel': '0', 'collapsed': 'false'}
The "row" in question seems to be from this piece of XML:
<sheetData>
<row r="1" customFormat="false" ht="12.8" hidden="false" customHeight="false" outlineLevel="0" collapsed="false">
<c r="A1" s="0" t="s">
<v>
0
</v>
</c>
</row>
</sheetData>
Rows from the working file look like this:
<sheetData>
<row r="1" spans="1:9" s="1" customFormat="1" x14ac:dyDescent="0.3">
<c r="A1" s="1" t="s">
<v>
0
</v>
</c>
...
</row>
...
<sheetData>
I could get it to work with this local patch:
#0 20:23:09 ᛋ fix_example ~/.vmodules/xlsx❱git diff
diff --git src/parser.v src/parser.v
index 79ae736..2ead1ce 100644
--- src/parser.v
+++ src/parser.v
@@ -135,7 +135,7 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
row_label := row.attributes['r'] or { return error('Row does not include location.') }
row_index := row_label.int() - 1
- span_string := row.attributes['spans'] or { return error('Row does not include span.') }
+ span_string := row.attributes['spans'] or { '1:1' }
span := span_string.split(':').map(it.int())
cell_count := span[1] - span[0] + 1
@@ -185,7 +185,6 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
cells: cells
}
}
-
return Sheet{
name: name
rows: rows
diff --git src/query.v src/query.v
index f3cfc5f..3b4de77 100644
--- src/query.v
+++ src/query.v
@@ -16,9 +16,6 @@ pub fn (sheet Sheet) get_all_data() !DataFrame {
}
pub fn (sheet Sheet) get_data(top_left Location, bottom_right Location) !DataFrame {
- if top_left == bottom_right {
- return DataFrame{}
- }
if top_left.row >= sheet.rows.len {
return error('top_left.row out of range')
}
#0 20:23:10 ᛋ fix_example ~/.vmodules/xlsx❱
... and then:
import xlsx
fn main() {
workbook := xlsx.Document.from_file('abc.xlsx')!
println('[info] Successfully loaded workbook with ${workbook.sheets.len} worksheets.')
println('\nAvailable sheets:')
for index, key in workbook.sheets.keys() {
println(' sheet ${index+1} has key: "${key}"')
}
sheet1 := workbook.sheets[1]
dataset := sheet1.get_all_data()!
count := dataset.row_count()
println('\n[info] Sheet 1 has ${count} rows.')
headers := dataset.raw_data[0]
println('\nThe headers are:')
for index, header in headers {
println('${index + 1}. ${header}')
}
}
This however makes one of the tests of the module to fail, and I do not know the module enough to fix the error:
FAIL [2/5] C: 752.3 ms, R: 6.664 ms /home/delian/.vmodules/xlsx/spec/01_empty_file/empty_test.v
retry: 1
comp_cmd: '/space/v/oo/v' -skip-running -o '/tmp/v_1000/tsession_7f6a3f911b80_201668685110535/01HPSJM820B18X879M4Y3BJTY8/empty_test' '/home/delian/.vmodules/xlsx/spec/01_empty_file/empty_test.v'
run_cmd: '/tmp/v_1000/tsession_7f6a3f911b80_201668685110535/01HPSJM820B18X879M4Y3BJTY8/empty_test'
failure code: 1; foutput.len: 393; failure output:
---------------------------------------------------------------------------------------------------- retry: 0 ; max_retry: 0 ; r.exit_code: 1 ; trimmed_output.len: 225
/home/delian/.vmodules/xlsx/spec/01_empty_file/empty_test.v:10: ✗ fn test_empty failed propagation with error: top_left.row out of range
10 | assert sheet.get_all_data()! == xlsx.DataFrame{}
To me, it seems logical, that a sheet, that has a single cell filled, should be supported 🤔 . Perhaps it is just an off by one error in one of the checks?
This patch seems to work better (the tests pass too):
#0 20:33:56 ᛋ fix_example ~/.vmodules/xlsx❱git diff
diff --git src/parser.v src/parser.v
index 79ae736..d4da2ad 100644
--- src/parser.v
+++ src/parser.v
@@ -135,7 +135,7 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
row_label := row.attributes['r'] or { return error('Row does not include location.') }
row_index := row_label.int() - 1
- span_string := row.attributes['spans'] or { return error('Row does not include span.') }
+ span_string := row.attributes['spans'] or { '0:0' }
span := span_string.split(':').map(it.int())
cell_count := span[1] - span[0] + 1
@@ -185,7 +185,6 @@ fn Sheet.from_doc(name string, doc xml.XMLDocument, shared_strings []string) !Sh
cells: cells
}
}
-
return Sheet{
name: name
rows: rows
diff --git src/query.v src/query.v
index f3cfc5f..b53c996 100644
--- src/query.v
+++ src/query.v
@@ -16,7 +16,7 @@ pub fn (sheet Sheet) get_all_data() !DataFrame {
}
pub fn (sheet Sheet) get_data(top_left Location, bottom_right Location) !DataFrame {
- if top_left == bottom_right {
+ if bottom_right.row == 0 && bottom_right.row == 0 && sheet.rows.len == 0 {
return DataFrame{}
}
if top_left.row >= sheet.rows.len {
#0 20:33:59 ᛋ fix_example ~/.vmodules/xlsx❱
The fix is in https://github.com/hungrybluedev/xlsx/pull/6 .
The PR also fixes opening the test.xlsx file, that @MedLabs uploaded.
#0 20:42:58 ᛋ master /v/oo❱v run open_xlsx.v ~/Downloads/test.xlsx
[info] Successfully loaded workbook with 1 worksheets.
Available sheets:
sheet 1 has key: "1"
[info] Sheet 1 has 4 rows.
The headers are:
1. name
2. description
#0 20:43:10 ᛋ master /v/oo❱
The .xls file is imho out of scope, my file
program identifies it as: Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1256
, while for the .xlsx, it shows Microsoft Excel 2007+
.
It is definitely not a ZIP archive, containing XML.
xls and xlsx are 2 very different formats...
"The primary difference between XLS and XLSX files lies in the underlying file format.
XLS files are based on the Binary Interchange File Format (BIFF) and store information in binary format.
On the other hand, XLSX files are XML-based with a more open structure, making them compatible with not only Microsoft Windows but also MacOS, iOS, and Android platforms."
I opened LibreOffice Calc, typed
abc
in the first cell, then saved the file as:It produced this file: abc.xlsx
I then tried to open it with a V program containing:
The result was:
The xml in that tmp folder is: