python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.6k stars 1.13k forks source link

feat: add buffered operations for bulk table read/write #1209

Open dashingdove opened 1 year ago

dashingdove commented 1 year ago

Table.cell performs very badly for large tables as it appears to rebuild the cell array every time you call it.

I found this to be an issue when using the htmldocx package. When adding a table, the cell function is called once for each cell which degrades performance significantly.

I have been able to circumvent this issue by getting _cells once and then referring to that array inside of the loop instead. However, this is a private property and it might be good to have an "official" way to do this without hacking around.

toxicphreAK commented 1 year ago

You may have a look here: https://github.com/toxicphreAK/python-docx-ng I tried to implement this in https://github.com/toxicphreAK/python-docx-ng/pull/1 https://github.com/toxicphreAK/python-docx-ng/pull/8 and https://github.com/python-openxml/python-docx/pull/1196 Hopefully it may help you.

scanny commented 1 year ago

Yeah, tables in Word are complicated because they are so flexible. So you really need at least all prior rows as context to compute whether a particular cell is merged and so forth.

Add this to the characteristic of python-docx that all document state is (necessarily) stored in the XML, and the fact we can't tell whether you've mutated the table between two Table.cell calls, then you get this situation.

I think what's called for here is two alternatives (three if you count the current functionality), depending on whether you're reading or writing:

I'm making this a feature request. Not sure when or if we'll get to it, but it would be a solid enhancement.