python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.54k stars 1.11k forks source link

Odd table shapes issue #1405

Closed aanastasiou closed 3 months ago

aanastasiou commented 3 months ago

I am pre-processing a large number of .docx documents with really oddly shaped tables containing text that has to be extracted verbatim.

As useful python-docx has been in this task, a subset of those documents revealed a tiny little bug in this line.

This PR fixes cases of odd table shapes were the strategy of populating a cell with the value of the previous cell (e.g. in the case of row/cell merges) fails, because there simply has not been a 'previous cell' yet.

Please note, I would be glad to contribute a test case as well but this might take a bit more time, tracking down the exact table (within the XML) that causes the bug and creating an "equivalent" test case.

Hope this helps.

scanny commented 3 months ago

@aanastasiou there was a recent update that addressed the "skipped-cells" condition that is actually a legitimate (although relatively unusual) table state.

If you use Table.rows to get rows and then iterate _Row.cells to get each cell you shouldn't have a problem there.

Depending on your needs for column alignment you may want to use _Row.grid_cols_before and .grid_cols_after to discover the empty leading and trailing cells.

There is also a new _Cell.grid_span property so you can tell how many grid-cells a horizontally-merged cell occupies.

I'm not sure what we'll do with Table._cells. It's possible that collection will be deprecated or perhaps we'll reimplement it based on the new "skipped-cell-aware" code, but for now it is probably better to avoid it in favor of the new methods.

aanastasiou commented 3 months ago

@scanny thank you very much for the prompt response. This was using the latest python-docx from pypi, would this recent update be applied to the version on github rather than pypi? Thanks for the rest of the information, it's good to know for our next code revision.

scanny commented 3 months ago

This change appears in v1.1.2, which is the current PyPI version, released on May 1, 2024: https://pypi.org/project/python-docx/ https://github.com/python-openxml/python-docx/commit/f4a48b5565a3a09087f541e3ac36a447693927b4

aanastasiou commented 3 months ago

@scanny This is the version that I used (and eventually led me to file this PR)

scanny commented 3 months ago

Show me the client code that isn't working the way you want.

aanastasiou commented 2 months ago

@scanny The PR contains the exact problem that I dealt with (and how), what might take longer is me locating the exact document that causes this behaviour.

scanny commented 2 months ago

@aanastasiou the idea there is not that this problem with table._cells is fixed for your case, but rather that you should no longer need to use table._cells and can use something like (c for row in table.rows for c in row.cells).

If you can post the code you're using to traverse cells and which gives rise to the error you mention I expect I'll be able to describe how to modify it to avoid any exceptions for uneven row lengths.