Open ghost opened 3 years ago
Tables can be nested, which would be my first guess. Also, tables can be in headers and footers and would need to be accessed there separately, although perhaps much less often containing interesting information.
A Table cell is a block item container (can contain paragraphs and tables (block-level items)). So you need something recursive like this to get them all:
def iter_tables(block_item_container):
"""Recursively generate all tables in `block_item_container`."""
for t in block_item_container.tables:
yield t
for row in t.rows:
for cell in row.cells:
yield from iter_tables(cell)
for t in iter_tables(document):
do_table_thing(t)
I incorporated your suggestion:
def iter_tables(block_item_container):
"""Recursively generate all tables in `block_item_container`."""
for t in block_item_container.tables:
yield t
for row in t.rows:
for cell in row.cells:
yield from iter_tables(cell)
dfs = []
for t in iter_tables(document):
table = t
df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
for i, row in enumerate(table.rows):
for j, cell in enumerate(row.cells):
if cell.text:
df[i][j] = cell.text.replace('\n', '')
dfs.append(pd.DataFrame(df))
I hope I understood and applied it correctly. This does not change the output however. Based on a manual examination of the docx it does not seem that tables are in headers or footers.
What patterns do you notice? Which tables are being skipped?
One possibility that occurs to me is that you have pending revisions. Any tables that are "inside" a pending revision will not be enumerated. Doing "Accept all revisions" and turning off revision marks in the document should take care of that possibility.
I manually opened the docx, then selected 'Review' and clicked on 'Accept all Changes' and also 'Accept All Changes and Stop Tracking'. Further I can see the docx contains 0 Revisions. The output did not change.
I did not notice any pattern, which tables are recognised seems pretty random.
I'd say the next step is to inspect the XML and look for elements that might be surrounding the <w:tbl>
elements and thereby "hiding" them from document.tables
.
Thank you. I will try it
Hi Guys,
Regarding to this issue
Thank you. I will try it
Have you solved the problem? I am having the same issue. I have a docx word file and it contains several tables and library doesnt detect any table. My function is
def format_tables_and_paragraphs(self, docfinal):
try:
# Abre el documento generado
doc = Document(self.docpath)
# Recorre todas las tablas del documento
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
# Recorre cada párrafo dentro de la celda
for paragraph in cell.paragraphs:
for run in paragraph.runs:
apply_formatting(run) # Aplica formato basado en prefijos
doc.save(docfinal)
except Exception as e:
tb = traceback.format_exc(1)
logging.error('ERROR Exception in edit_word the error is: '+repr(e) + " - args - "+repr(tb))
Thanks
Hi there, I am using python-docx 0.8.11 and Python 3.8. I want to get ALL tables contained in a relatively large word docx file. I am using this code:
However it seems that a lot of tables are not extracted. Do you have any idea why that might be? Thank you very much!