jdum / odfdo

python library for OpenDocument format (ODF)
Apache License 2.0
58 stars 11 forks source link

`Table.traverse()` is extremely slow when table has many blanks #46

Closed GPHemsley-RELX closed 3 weeks ago

GPHemsley-RELX commented 3 months ago

I have an ODS file that includes the following silly row at the end (saved from Excel):

<table:table-row table:number-rows-repeated="1048563" table:style-name="ro1"><table:table-cell table:number-columns-repeated="16384"/></table:table-row>

Attempting to run Table.traverse() on this file (or any of the row functions that rely on it) without first calling Table.optimize_width() takes an extremely long time for what is essentially just a bunch of blanks.

Is this an inherent speed limitation of Python, or can the traverse repeat algorithm be optimized somehow?

jdum commented 3 months ago

A complex question. In short: odfdo is inherently slow.

What does traverse() do:

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

GPHemsley-RELX commented 3 months ago
  • odfdo is clearly slow because of its design: it directly accesses the underlying XML structure. The reason is to allow direct modification of any XML element in the document without using an intermediate format.

Yeah, that's the sticking point, I think. The example in question was just a test document Save As'd from XLSX. The real-world documents I'm handling take about 10 minutes to run on a fairly reasonable amount of records, though I'm wondering if that is affected by every cell including table:number-columns-spanned="1" table:number-rows-spanned="1"?

I don't see much room for optimization. One could imagine stripping the document on opening (with optimize_width or something), but that would change the actual content of the document, which is bad. In your example, all lines have a style. So maybe the author of the original document wants all lines to have a specific appearance. The optimize_width() method actually cuts the document: if you style a row or column to be "blue background", you can you can obtain such a big "number-rows-repeated". The optimize_width() method will cut the document when there is no real content below except the styles. But the document is modified: this must be a deliberate choice of the user. And, maybe you really want to edit the 100,000th line.

Yeah, I'm only reading in files to get at their data, so I was trying to avoid this, but it looks like Table.rstrip() may be what I need to squeeze what I can out of things.

GPHemsley-RELX commented 3 weeks ago

For a concrete example:

  1. Go to https://edopl.idaho.gov/OnlineServices/ and click on "Search for Licenses".
  2. Set Search Type to List, set Board to "Board of Nursing", and hit Search.
  3. Once the search completes (it will take a while), hit "Export" to download the ODS file.

The file will have 15 columns and ~83k rows, and will be ~6 MB in size.

In the way I'm using this library (I haven't tested it raw), it takes ~4 hours to read this file in, even though Excel takes seconds.

jdum commented 3 weeks ago

Thanks for the example file. It's very slow here too. Looking at the code I see some optimisations, but it will depend on the type of query:

GPHemsley-RELX commented 3 weeks ago

Any speed increase will be welcome!

Did you want to reopen this issue then?

jdum commented 2 weeks ago

I think the bug is solved in the last release (v3.9.4): here it takes now 1.2 sec to parse your test file of ~83k rows.

>>> from odfdo import *
>>> from time import perf_counter
>>> print(__version__)
3.9.4
>>> doc = Document('~/Desktop/big.ods')
>>> table = doc.body.tables[0]
>>> def test():
...     counter = 0
...     t0 = perf_counter()
...     for row in table.traverse():
...         counter += 1
...     print(counter, perf_counter() - t0)
...     
>>> test()
83355 1.1783001250005327
>>> 
GPHemsley-RELX commented 2 weeks ago

That is so much better! Thank you!