Closed HiGregSmith closed 3 years ago
What are the semantics of those elements?
Microsoft Word for Microsoft 365 MSO (16.0.13231.20372) 64-bit XML Excerpt:
I've added a small number of details here https://github.com/python-openxml/python-docx/pull/564#issuecomment-710678543
I thought about fixing this by adding support on the Row class for gridAfter and gridBefore. Then the issue is how to expose the row and cell indexing to the user. It seems like the table cells are in a linear list, repeated for each gridSpan. It might be plausible to repeat the first cell in a row a number of times equal to gridBefore + gridSpan (defaulting to 0 and 1 respectively) and repeat the last cell in a row a number of times equal to gridAfter + gridSpan (again, defaulting to 0 and 1 respectively).
However, I would prefer to iterate through the cells in a row and get a single cell element when that cell contains a gridSpan. Possibly exposing a gridSpan property on the Cell. If cell access was modified in this manner, then the Row could expose gridAfter and gridBefore properties and cell iteration would still only return a single Cell instance for each iteration, effectively "skipping" the cells that would otherwise exist for gridAfter, gridBefore, and any gridSpan values.
Yeah, the interface design for cell access is tricky and the current implementation is what we settled on after a lot of consideration and experimentation. Our primary concern is making it accessible for most folks for the most common cases, and keeping the cognitive load low.
I'm inclined to think that any alternative cell access scheme at least start out life as a parallel implementation. Then maybe if it's clearly simpler for folks to understand maybe we replace what's there, but I doubt we'd ever get there. Tables are just complicated things and most folks only care about the simple cases.
Anyway, you might want to create an alternate proxy object for tables, perhaps one that subclasses the existing Table
object, and give it access semantics that suit your use case. Looking at how the Microsoft API handles grid-offsets could possibly be interesting.
One possibility might be a AltTable.iter_cells()
method which generates one _AltCell
object for each actual <w:tc>
element in the table and then has properties for things like its grid-row, grid-column, horz-span, vert-span, etc. and then leaves the working out of what's happening to the caller without trying to fit them neatly into a matrix concept.
I was also thinking about a parallel access method as well so current users don't have to update code. I like your proposed implementation. I only need reading for the moment, so an iterator as you mention would be ideal. I have been exploring the object properties and getting access to the XML directly through the _tr and _tc objects. They seem to have access to the gridAfter XML element in my file and access to gridSpan element in the TcPr elements as well, so I think all the info is within the structure.
I'm having difficulty extracting text from the XML element. From a long time ago, I remember accessing text by looking at "pre" text, then descendants in order then "post" text. itertext() method on _tc seems to duplicate text in my case, but I'm still exploring. Thanks for your input!
I added the following definition to class Table in table.py:
def iter_row_cells(self, row_idx):
"""
iteration over row cells with ONE cell per actual table cell. This does not repeat cells with gridSpan and properly handles (by ignoring) gridAfter and gridBefore.
Use of this method results in a ragged array of cells:
A potentially different number of cells per row.
"""
# HiGregSmith added
for c in self._tbl.tr_lst[row_idx].tc_lst:
yield _Cell(c,self)
Then to use this call, I needed to get the max row index. Ideally, this would be abstracted, but this seems the most direct way to access by row number rather than an iterator for row and for cell within a row. Note also that this is more properly called a "generator" and only creates Cell objects when needed rather than creating them all at once for the row (or table, for that matter).
documentlist is a list of filenames each of which is suitable for using when calling Document(). This example prints out the text in each cell with a maximum horizontal spacing and a vertical line with all the cells in each row appearing on one line:
import docx
from pprint import pprint
for docname in documentlist:
document = docx.Document(docname)
for table in document.tables:
for row_idx in range(len(table._tbl.tr_lst)):
for c in table.iter_row_cells(row_idx):
print ('{!r: ^{fieldsize}.{fieldsize}}|'.format(c.text,fieldsize=19),end='')
print ('')
print('-'*40)
Edit: Here's another way to define iter_row_cells using map iterator instead of creating a generator with yield:
return map(lambda c: _Cell(c,self), self._tbl.tr_lst[row_idx].tc_lst)
Edit2: Here's how to get a row iterator of cell iterators:
rowiter = map(table.iter_row_cells,range(len(table._tbl.tr_lst)))
Or All In One (if iter_row_cells()
is not available):
rowiter = map(map(table.iter_row_cells,range(len(table._tbl.tr_lst))),range(len(table._tbl.tr_lst)))
@HiGregSmith Nice :) Thanks for posting what you came up with :)
Function for correct retrieve real cells from row:
def real_row_cells(row:'_Row') -> 'Sequence[_Cell]':
from docx.oxml.simpletypes import ST_Merge
from docx.table import _Cell
cells = []
table:'Table' = row.table
drows = getattr(table, '_real_rows', {})
if not drows:
table._real_rows = drows = {}
row_ind = row._index
if ocells := drows.get(row_ind):
return ocells
for i, tc in enumerate(row._tr.tc_lst):
for grid_span_idx in range(tc.grid_span):
if tc.vMerge == ST_Merge.CONTINUE:
prev_cels = real_row_cells(table.rows[row_ind - 1])
cells.append(prev_cels[i])
elif grid_span_idx > 0:
cells.append(cells[-1])
else:
cells.append(_Cell(tc, table))
ocells = tuple(cells)
drows[row_ind] = ocells
return ocells
Function for correct retrieve real cells from row:
This does not work for me.
I added the following definition to class Table in table.py:
def iter_row_cells(self, row_idx):
And here is a version that does not need monkey patching:
def table_itercells(table):
for row_idx in range(len(table._tbl.tr_lst)):
for cell in table._tbl.tr_lst[row_idx].tc_lst:
yield _Cell(cell, table)
The following function table_iterrows()
helped me to parse table rows correctly
from typing import Tuple
from dataclasses import dataclass
from docx.table import Table, _Cell
@dataclass
class _SimpleRow:
cells: Tuple[_Cell, ...]
def table_iterrows(table: Table) -> _SimpleRow:
for row_idx in range(len(table._tbl.tr_lst)):
cells = []
for cell in table._tbl.tr_lst[row_idx].tc_lst:
cells.append(_Cell(cell, table))
yield _SimpleRow(tuple(cells))
It returns a _SimpleRow
dataclass, so that you can simply replace code like:
for item in doc.iter_inner_content():
if isinstance(item, docx.table.Table):
for row in item.row:
for cell in row.cells:
print(cell.text)
with
for item in doc.iter_inner_content():
if isinstance(item, docx.table.Table):
for row in table_iterrows(item):
for cell in row.cells:
print(cell.text)
When iterating over cells within rows, the index calculations for cells and rows within the table are off by the accumulated w:gridBefore and w:gridAfter from prior rows.