Closed GPHemsley-RELX closed 3 weeks ago
A complex question. In short: odfdo is inherently slow.
What does traverse() do:
odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.
In ODF, there is no way to directly access a row or cell without traversing all previous ones. Because of the "number-rows-repeated" feature. So naive row iteration would be an O(n2) algorithm and cell iteration an O(n4) algorithm.
The traverse() method (and the other get_row methods) uses and builds a cache: the second time an item is requested, it is no longer necessary to count all the previous rows to find it. So there is a little extra time the first time to build the cache.
Another very expensive operation is creating the Row() or Cell() object if it is not already in the cache. So number-rows-repeated = 1048563
is nearly a zip bomb, 1 million underlying Python and XML objects are created.
I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.
- odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.
Yeah, that's the sticking point, I think. The example in question was just a test document Save As'd from XLSX. The real-world documents I'm handling take about 10 minutes to run on a fairly reasonable amount of records, though I'm wondering if that is affected by every cell including table:number-columns-spanned="1" table:number-rows-spanned="1"
?
I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.
Yeah, I'm only reading in files to get at their data, so I was trying to avoid this, but it looks like Table.rstrip()
may be what I need to squeeze what I can out of things.
For a concrete example:
The file will have 15 columns and ~83k rows, and will be ~6 MB in size.
In the way I'm using this library (I haven't tested it raw), it takes ~4 hours to read this file in, even though Excel takes seconds.
Thanks for the example file. It's very slow here too. Looking at the code I see some optimisations, but it will depend on the type of query:
Any speed increase will be welcome!
Did you want to reopen this issue then?
I think the bug is solved in the last release (v3.9.4): here it takes now 1.2 sec to parse your test file of ~83k rows.
>>> from odfdo import *
>>> from time import perf_counter
>>> print(__version__)
3.9.4
>>> doc = Document('~/Desktop/big.ods')
>>> table = doc.body.tables[0]
>>> def test():
... counter = 0
... t0 = perf_counter()
... for row in table.traverse():
... counter += 1
... print(counter, perf_counter() - t0)
...
>>> test()
83355 1.1783001250005327
>>>
That is so much better! Thank you!
I have an ODS file that includes the following silly row at the end (saved from Excel):
Attempting to run
Table.traverse()
on this file (or any of the row functions that rely on it) without first callingTable.optimize_width()
takes an extremely long time for what is essentially just a bunch of blanks.Is this an inherent speed limitation of Python, or can the traverse repeat algorithm be optimized somehow?