tealeg / xlsx

Go library for reading and writing XLSX files.
Other
5.81k stars 810 forks source link

ValueOnly option truncating break xml #739

Closed horsley closed 2 years ago

horsley commented 2 years ago

Got an error, xml.Decoder.Decode: XML syntax error on line 2: element <row> closed by </c>

a file with dimension <dimension ref="A1:I1038378"/>

many rows with empty cells.

I'll try to fix that

horsley commented 2 years ago

Origin xml

<row r="3" spans="4:8">
<c r="D3" s="26" t="s"><v>2189</v></c>
<c r="E3" s="26" t="str"><f t="shared" si="0"/><v>sometext</v></c>
<c r="G3" s="26" t="s"><v>2186</v></c>
<c r="H3" s="26" t="s"><v>2190</v></c>
</row>

After truncateSheetXMLValueOnly

<row r="3" spans="4:8">
<c r="D3" s="26" t="s"><v>2189</v></c>
<v>sometext</v></c>
<c r="G3" s="26" t="s"><v>2186</v></c>
<c r="H3" s="26" t="s"><v>2190</v></c>
</row>

It seems that the bug was wrong regex in https://github.com/tealeg/xlsx/blob/840e9ee9b9adac9706540392005df460771b08cd/lib.go#L1154

It's matched with the tag f

horsley commented 2 years ago

regex can be change to (?s)<c [^>]*?/[>|c>]

horsley commented 2 years ago

btw, func truncateSheetXMLValueOnly was quite low efficiency because replace iteration by row and cell.

the logic was detecting rows or cells without value (v tag), why we just using regex to matched them, and replace all

It tasks 20 second on my mac to truncate a xml with 100,0000 rows