The .docx file parsing cannot handle images within tables, and when dealing with merged cells, duplicate content may appear.

Self Checks

[X] This is only for bug report, if you would like to ask a question, please head to Discussions.
[X] I have searched for existing issues search for existing issues, including closed ones.
[X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[X] [FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:）
[X] Please do not modify this template :) and fill in all the required fields.

Dify version

0.6.15

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

The image in the table cannot be parsed, and the content across columns will be parsed repeatedly.

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

@yanghx-git Hello there! I'm here to assist you with any bugs, questions, or contributions while we wait for a human maintainer. Let me know how I can help you today.

To resolve the issue with the .docx file parser in Dify version 0.6.15, where images within table cells are not parsed and content in merged cells is duplicated, you need to make the following changes:

Ensure images within table cells are parsed: Modify the _table_to_markdown method to include image extraction logic.
Handle merged cells properly: Update the _table_to_markdown method to avoid duplicating content in merged cells.

Here is the updated code for the _table_to_markdown method:

def _table_to_markdown(self, table, image_map):
    markdown = ""
    # deal with table headers
    header_row = table.rows[0]
    headers = [self._parse_cell(cell, image_map) for cell in header_row.cells]
    markdown += "| " + " | ".join(headers) + " |\n"
    markdown += "| " + " | ".join(["---"] * len(headers)) + " |\n"
    # deal with table rows
    for row in table.rows[1:]:
        row_cells = [self._parse_cell(cell, image_map) for cell in row.cells]
        markdown += "| " + " | ".join(row_cells) + " |\n"

    return markdown

def _parse_cell(self, cell, image_map):
    cell_content = []
    for paragraph in cell.paragraphs:
        cell_content.append(self._parse_paragraph(paragraph, image_map))
    return ' '.join(cell_content).strip()

Additionally, ensure that the parse_docx method calls _table_to_markdown with the image_map parameter:

def parse_docx(self, docx_path, image_folder):
    doc = DocxDocument(docx_path)
    os.makedirs(image_folder, exist_ok=True)

    content = []

    image_map = self._extract_images_from_docx(doc, image_folder)

    def parse_paragraph(paragraph):
        paragraph_content = []
        for run in paragraph.runs:
            if run.element.tag.endswith('r'):
                drawing_elements = run.element.findall(
                    './/{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing')
                for drawing in drawing_elements:
                    blip_elements = drawing.findall(
                        './/{http://schemas.openxmlformats.org/drawingml/2006/main}blip')
                    for blip in blip_elements:
                        embed_id = blip.get(
                            '{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed')
                        if embed_id:
                            image_part = doc.part.related_parts.get(embed_id)
                            if image_part in image_map:
                                paragraph_content.append(image_map[image_part])
            if run.text.strip():
                paragraph_content.append(run.text.strip())
        return ''.join(paragraph_content) if paragraph_content else ''

    paragraphs = doc.paragraphs.copy()
    tables = doc.tables.copy()
    for element in doc.element.body:
        if element.tag.endswith('p'):  # paragraph
            para = paragraphs.pop(0)
            parsed_paragraph = parse_paragraph(para)
            if parsed_paragraph:
                content.append(parsed_paragraph)
        elif element.tag.endswith('tbl'):  # table
            table = tables.pop(0)
            content.append(self._table_to_markdown(table, image_map))
    return '\n'.join(content)

These changes should address the issues with images within table cells not being parsed and content in merged cells being duplicated ^[1].

To continue talking to Dosu, mention @dosu.

langgenius / dify