feat: add buffered operations for bulk table read/write

dashingdove commented 1 year ago

Table.cell performs very badly for large tables as it appears to rebuild the cell array every time you call it.

I found this to be an issue when using the htmldocx package. When adding a table, the cell function is called once for each cell which degrades performance significantly.

I have been able to circumvent this issue by getting _cells once and then referring to that array inside of the loop instead. However, this is a private property and it might be good to have an "official" way to do this without hacking around.

toxicphreAK commented 1 year ago

You may have a look here: https://github.com/toxicphreAK/python-docx-ng I tried to implement this in https://github.com/toxicphreAK/python-docx-ng/pull/1 https://github.com/toxicphreAK/python-docx-ng/pull/8 and https://github.com/python-openxml/python-docx/pull/1196 Hopefully it may help you.

scanny commented 1 year ago

Yeah, tables in Word are complicated because they are so flexible. So you really need at least all prior rows as context to compute whether a particular cell is merged and so forth.

Add this to the characteristic of python-docx that all document state is (necessarily) stored in the XML, and the fact we can't tell whether you've mutated the table between two Table.cell calls, then you get this situation.

I think what's called for here is two alternatives (three if you count the current functionality), depending on whether you're reading or writing:

In the reading case, there could be a table "snapshot" that you can read as much as you want without anything needing to be recomputed. The contract would be that any table mutations are ignored by the snapshot, but as long as folks know about that and know they aren't going to be changing anything, they can get much higher read performance.
In the writing case, there could be a table "buffer" object that you could write to performantly, but didn't actually mutate the document until you called its .save() or maybe .sync() method.

I'm making this a feature request. Not sure when or if we'll get to it, but it would be a solid enhancement.

python-openxml / python-docx

feat: add buffered operations for bulk table read/write #1209