python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.63k stars 1.13k forks source link

Python-docx does not recognise/extract all table in docx #1015

Open ghost opened 3 years ago

ghost commented 3 years ago

Hi there, I am using python-docx 0.8.11 and Python 3.8. I want to get ALL tables contained in a relatively large word docx file. I am using this code:

dfs = []
for table in document.tables:
    df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
    for i, row in enumerate(table.rows):
        for j, cell in enumerate(row.cells):
            if cell.text:
                df[i][j] = cell.text.replace('\n', '')
    dfs.append(pd.DataFrame(df))

However it seems that a lot of tables are not extracted. Do you have any idea why that might be? Thank you very much!

scanny commented 3 years ago

Tables can be nested, which would be my first guess. Also, tables can be in headers and footers and would need to be accessed there separately, although perhaps much less often containing interesting information.

A Table cell is a block item container (can contain paragraphs and tables (block-level items)). So you need something recursive like this to get them all:

def iter_tables(block_item_container):
    """Recursively generate all tables in `block_item_container`."""
    for t in block_item_container.tables:
        yield t
        for row in t.rows:
            for cell in row.cells:
                yield from iter_tables(cell)

for t in iter_tables(document):
    do_table_thing(t)
ghost commented 3 years ago

I incorporated your suggestion:

def iter_tables(block_item_container):
    """Recursively generate all tables in `block_item_container`."""
    for t in block_item_container.tables:
        yield t
        for row in t.rows:
            for cell in row.cells:
                yield from iter_tables(cell)

dfs = []
for t in iter_tables(document):
    table = t
    df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
    for i, row in enumerate(table.rows):
        for j, cell in enumerate(row.cells):
            if cell.text:
                df[i][j] = cell.text.replace('\n', '')
    dfs.append(pd.DataFrame(df))

I hope I understood and applied it correctly. This does not change the output however. Based on a manual examination of the docx it does not seem that tables are in headers or footers.

scanny commented 3 years ago

What patterns do you notice? Which tables are being skipped?

One possibility that occurs to me is that you have pending revisions. Any tables that are "inside" a pending revision will not be enumerated. Doing "Accept all revisions" and turning off revision marks in the document should take care of that possibility.

ghost commented 3 years ago

I manually opened the docx, then selected 'Review' and clicked on 'Accept all Changes' and also 'Accept All Changes and Stop Tracking'. Further I can see the docx contains 0 Revisions. The output did not change.

I did not notice any pattern, which tables are recognised seems pretty random.

scanny commented 3 years ago

I'd say the next step is to inspect the XML and look for elements that might be surrounding the <w:tbl> elements and thereby "hiding" them from document.tables.

ghost commented 3 years ago

Thank you. I will try it

srPuebla commented 2 months ago

Hi Guys,

Regarding to this issue

Thank you. I will try it

Have you solved the problem? I am having the same issue. I have a docx word file and it contains several tables and library doesnt detect any table. My function is


def format_tables_and_paragraphs(self, docfinal):
        try:

            # Abre el documento generado
            doc = Document(self.docpath)

            # Recorre todas las tablas del documento
            for table in doc.tables:
                for row in table.rows:
                    for cell in row.cells:
                        # Recorre cada párrafo dentro de la celda
                        for paragraph in cell.paragraphs:
                            for run in paragraph.runs:
                                apply_formatting(run)  # Aplica formato basado en prefijos

            doc.save(docfinal)

        except Exception as e:
            tb = traceback.format_exc(1)
            logging.error('ERROR Exception in edit_word the error is: '+repr(e) + " - args - "+repr(tb))

Thanks