`Table.traverse()` is extremely slow when table has many blanks

GPHemsley-RELX commented 3 months ago

I have an ODS file that includes the following silly row at the end (saved from Excel):

<table:table-row table:number-rows-repeated="1048563" table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row>

Attempting to run Table.traverse() on this file (or any of the row functions that rely on it) without first calling Table.optimize_width() takes an extremely long time for what is essentially just a bunch of blanks.

Is this an inherent speed limitation of Python, or can the traverse repeat algorithm be optimized somehow?

jdum commented 3 months ago

A complex question. In short: odfdo is inherently slow.

What does traverse() do:

odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.
In ODF, there is no way to directly access a row or cell without traversing all previous ones. Because of the "number-rows-repeated" feature. So naive row iteration would be an O(n2) algorithm and cell iteration an O(n4) algorithm.
The traverse() method (and the other get_row methods) uses and builds a cache: the second time an item is requested, it is no longer necessary to count all the previous rows to find it. So there is a little extra time the first time to build the cache.
Another very expensive operation is creating the Row() or Cell() object if it is not already in the cache. So number-rows-repeated = 1048563 is nearly a zip bomb, 1 million underlying Python and XML objects are created.

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

GPHemsley-RELX commented 3 months ago

odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.

Yeah, that's the sticking point, I think. The example in question was just a test document Save As'd from XLSX. The real-world documents I'm handling take about 10 minutes to run on a fairly reasonable amount of records, though I'm wondering if that is affected by every cell including table:number-columns-spanned="1" table:number-rows-spanned="1"?

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

Yeah, I'm only reading in files to get at their data, so I was trying to avoid this, but it looks like Table.rstrip() may be what I need to squeeze what I can out of things.

GPHemsley-RELX commented 3 weeks ago

For a concrete example:

Go to https://edopl.idaho.gov/OnlineServices/ and click on "Search for Licenses".
Set Search Type to List, set Board to "Board of Nursing", and hit Search.
Once the search completes (it will take a while), hit "Export" to download the ODS file.

The file will have 15 columns and ~83k rows, and will be ~6 MB in size.

In the way I'm using this library (I haven't tested it raw), it takes ~4 hours to read this file in, even though Excel takes seconds.

jdum commented 3 weeks ago

Thanks for the example file. It's very slow here too. Looking at the code I see some optimisations, but it will depend on the type of query:

traverse() got an easy optimisation on end condition
traverse() is ugly and do 2 things: crossing the rows AND generate/cache the actual rows. I will remove the generation part, working with some "virtual row" concept
table.rows property is ugly, will change to virtual row class and generating the actual rows on the fly. I should be able to do this without changing the current API... As a result, I expect that queries on a "limited" set of rows to be lot faster. But queries on all rows, like "from all rows extract the rows whose column C >= x" will be still slow. I could try to add a feature relying on XPATH to unconditionally generate all rows at once if requested... but I would not expect more than x2 or maybe x4 as speed increase.

GPHemsley-RELX commented 3 weeks ago

Any speed increase will be welcome!

Did you want to reopen this issue then?

jdum commented 2 weeks ago

I think the bug is solved in the last release (v3.9.4): here it takes now 1.2 sec to parse your test file of ~83k rows.

>>> from odfdo import *
>>> from time import perf_counter
>>> print(__version__)
3.9.4
>>> doc = Document('~/Desktop/big.ods')
>>> table = doc.body.tables[0]
>>> def test():
...     counter = 0
...     t0 = perf_counter()
...     for row in table.traverse():
...         counter += 1
...     print(counter, perf_counter() - t0)
...     
>>> test()
83355 1.1783001250005327
>>>

GPHemsley-RELX commented 2 weeks ago

That is so much better! Thank you!

jdum / odfdo

`Table.traverse()` is extremely slow when table has many blanks #46