python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.61k stars 1.13k forks source link

Does not correctly parse table rows with w:gridBefore or w:gridAfter elements #881

Closed HiGregSmith closed 3 years ago

HiGregSmith commented 4 years ago

When iterating over cells within rows, the index calculations for cells and rows within the table are off by the accumulated w:gridBefore and w:gridAfter from prior rows.

scanny commented 4 years ago

What are the semantics of those elements?

HiGregSmith commented 4 years ago

Microsoft Word for Microsoft 365 MSO (16.0.13231.20372) 64-bit XML Excerpt:

https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.gridafter?view=openxml-2.8.1 When the object is serialized out as xml, its qualified name is w:gridAfter. Remarks [ISO/IEC 29500-1 1st Edition] gridAfter (Grid Columns After Last Cell) This element specifies the number of grid columns in the parent table's table grid (§17.4.49; §17.4.48) which shall be left after the last cell in the table row. If this element conflicts with the remaining size of the document grid after all table cells in this row have been added to the grid, then it shall be ignored. If this element is not specified, then its value shall be assumed to be zero grid units. https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.gridbefore?view=openxml-2.8.1 When the object is serialized out as xml, its qualified name is w:gridBefore. Remarks [ISO/IEC 29500-1 1st Edition] gridBefore (Grid Columns Before First Cell) This element specifies the number of grid columns in the parent table's table grid (§17.4.49; §17.4.48) which must be skipped before the contents of this table row (its table cells) are added to the parent table. [Note: This property is used to specify tables whose leading edge (left for left-to-right tables, right for right-to-left tables) does not start at the first grid column (the same shared edge). end note] If this element is omitted, then its value shall be assumed to be zero grid units. If this element's value is larger than the size of the table grid, then the value shall be ignored and the first cell in the row can span the full table grid (i.e. the second cell, if one exists, should start at the last shared edge in the table).
HiGregSmith commented 4 years ago

I've added a small number of details here https://github.com/python-openxml/python-docx/pull/564#issuecomment-710678543

HiGregSmith commented 4 years ago

I thought about fixing this by adding support on the Row class for gridAfter and gridBefore. Then the issue is how to expose the row and cell indexing to the user. It seems like the table cells are in a linear list, repeated for each gridSpan. It might be plausible to repeat the first cell in a row a number of times equal to gridBefore + gridSpan (defaulting to 0 and 1 respectively) and repeat the last cell in a row a number of times equal to gridAfter + gridSpan (again, defaulting to 0 and 1 respectively).

However, I would prefer to iterate through the cells in a row and get a single cell element when that cell contains a gridSpan. Possibly exposing a gridSpan property on the Cell. If cell access was modified in this manner, then the Row could expose gridAfter and gridBefore properties and cell iteration would still only return a single Cell instance for each iteration, effectively "skipping" the cells that would otherwise exist for gridAfter, gridBefore, and any gridSpan values.

scanny commented 4 years ago

Yeah, the interface design for cell access is tricky and the current implementation is what we settled on after a lot of consideration and experimentation. Our primary concern is making it accessible for most folks for the most common cases, and keeping the cognitive load low.

I'm inclined to think that any alternative cell access scheme at least start out life as a parallel implementation. Then maybe if it's clearly simpler for folks to understand maybe we replace what's there, but I doubt we'd ever get there. Tables are just complicated things and most folks only care about the simple cases.

Anyway, you might want to create an alternate proxy object for tables, perhaps one that subclasses the existing Table object, and give it access semantics that suit your use case. Looking at how the Microsoft API handles grid-offsets could possibly be interesting.

One possibility might be a AltTable.iter_cells() method which generates one _AltCell object for each actual <w:tc> element in the table and then has properties for things like its grid-row, grid-column, horz-span, vert-span, etc. and then leaves the working out of what's happening to the caller without trying to fit them neatly into a matrix concept.

HiGregSmith commented 4 years ago

I was also thinking about a parallel access method as well so current users don't have to update code. I like your proposed implementation. I only need reading for the moment, so an iterator as you mention would be ideal. I have been exploring the object properties and getting access to the XML directly through the _tr and _tc objects. They seem to have access to the gridAfter XML element in my file and access to gridSpan element in the TcPr elements as well, so I think all the info is within the structure.

I'm having difficulty extracting text from the XML element. From a long time ago, I remember accessing text by looking at "pre" text, then descendants in order then "post" text. itertext() method on _tc seems to duplicate text in my case, but I'm still exploring. Thanks for your input!

HiGregSmith commented 4 years ago

I added the following definition to class Table in table.py:

    def iter_row_cells(self, row_idx):
        """
        iteration over row cells with ONE cell per actual table cell. This does not repeat cells with gridSpan and properly handles (by ignoring) gridAfter and gridBefore.
        Use of this method results in a ragged array of cells: 
        A potentially different number of cells per row.
        """
        # HiGregSmith added
        for c in self._tbl.tr_lst[row_idx].tc_lst:
            yield _Cell(c,self)

Then to use this call, I needed to get the max row index. Ideally, this would be abstracted, but this seems the most direct way to access by row number rather than an iterator for row and for cell within a row. Note also that this is more properly called a "generator" and only creates Cell objects when needed rather than creating them all at once for the row (or table, for that matter).

documentlist is a list of filenames each of which is suitable for using when calling Document(). This example prints out the text in each cell with a maximum horizontal spacing and a vertical line with all the cells in each row appearing on one line:

import docx
from pprint import pprint

for docname in documentlist:
    document = docx.Document(docname)
    for table in document.tables:
        for row_idx in range(len(table._tbl.tr_lst)):
            for c in table.iter_row_cells(row_idx):
                print ('{!r: ^{fieldsize}.{fieldsize}}|'.format(c.text,fieldsize=19),end='')
            print ('')
    print('-'*40)

Edit: Here's another way to define iter_row_cells using map iterator instead of creating a generator with yield:

return map(lambda c: _Cell(c,self), self._tbl.tr_lst[row_idx].tc_lst)

Edit2: Here's how to get a row iterator of cell iterators:

rowiter = map(table.iter_row_cells,range(len(table._tbl.tr_lst)))

Or All In One (if iter_row_cells() is not available):

rowiter = map(map(table.iter_row_cells,range(len(table._tbl.tr_lst))),range(len(table._tbl.tr_lst)))

scanny commented 3 years ago

@HiGregSmith Nice :) Thanks for posting what you came up with :)

tonal commented 1 year ago

Function for correct retrieve real cells from row:

def real_row_cells(row:'_Row') -> 'Sequence[_Cell]':
  from docx.oxml.simpletypes import ST_Merge
  from docx.table import _Cell

  cells = []
  table:'Table' = row.table

  drows = getattr(table, '_real_rows', {})
  if not drows:
    table._real_rows = drows = {}

  row_ind = row._index
  if ocells := drows.get(row_ind):
    return ocells

  for i, tc in enumerate(row._tr.tc_lst):
    for grid_span_idx in range(tc.grid_span):
      if tc.vMerge == ST_Merge.CONTINUE:
        prev_cels = real_row_cells(table.rows[row_ind - 1])
        cells.append(prev_cels[i])
      elif grid_span_idx > 0:
        cells.append(cells[-1])
      else:
        cells.append(_Cell(tc, table))

  ocells = tuple(cells)
  drows[row_ind] = ocells
  return ocells
caramdache commented 1 year ago

Function for correct retrieve real cells from row:

This does not work for me.

caramdache commented 1 year ago

I added the following definition to class Table in table.py:

    def iter_row_cells(self, row_idx):

And here is a version that does not need monkey patching:

def table_itercells(table):
    for row_idx in range(len(table._tbl.tr_lst)):
        for cell in table._tbl.tr_lst[row_idx].tc_lst:
            yield _Cell(cell, table)
asmaier commented 7 months ago

The following function table_iterrows() helped me to parse table rows correctly

from typing import Tuple
from dataclasses import dataclass

from docx.table import Table, _Cell

@dataclass
class _SimpleRow:
    cells: Tuple[_Cell, ...]

def table_iterrows(table: Table) -> _SimpleRow:
    for row_idx in range(len(table._tbl.tr_lst)):
        cells = []
        for cell in table._tbl.tr_lst[row_idx].tc_lst:
            cells.append(_Cell(cell, table))
        yield _SimpleRow(tuple(cells))

It returns a _SimpleRow dataclass, so that you can simply replace code like:

for item in doc.iter_inner_content():
    if isinstance(item, docx.table.Table):
        for row in item.row:
            for cell in row.cells:
                print(cell.text)

with

for item in doc.iter_inner_content():
    if isinstance(item, docx.table.Table):
        for row in table_iterrows(item):
            for cell in row.cells:
                print(cell.text)