langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
44.56k stars 6.25k forks source link

The .docx file parsing cannot handle images within tables, and when dealing with merged cells, duplicate content may appear. #6902

Closed yanghx-git closed 1 month ago

yanghx-git commented 1 month ago

Self Checks

Dify version

0.6.15

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

The image in the table cannot be parsed, and the content across columns will be parsed repeatedly. image

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

dosubot[bot] commented 1 month ago

@yanghx-git Hello there! I'm here to assist you with any bugs, questions, or contributions while we wait for a human maintainer. Let me know how I can help you today.

To resolve the issue with the .docx file parser in Dify version 0.6.15, where images within table cells are not parsed and content in merged cells is duplicated, you need to make the following changes:

  1. Ensure images within table cells are parsed: Modify the _table_to_markdown method to include image extraction logic.
  2. Handle merged cells properly: Update the _table_to_markdown method to avoid duplicating content in merged cells.

Here is the updated code for the _table_to_markdown method:

def _table_to_markdown(self, table, image_map):
    markdown = ""
    # deal with table headers
    header_row = table.rows[0]
    headers = [self._parse_cell(cell, image_map) for cell in header_row.cells]
    markdown += "| " + " | ".join(headers) + " |\n"
    markdown += "| " + " | ".join(["---"] * len(headers)) + " |\n"
    # deal with table rows
    for row in table.rows[1:]:
        row_cells = [self._parse_cell(cell, image_map) for cell in row.cells]
        markdown += "| " + " | ".join(row_cells) + " |\n"

    return markdown

def _parse_cell(self, cell, image_map):
    cell_content = []
    for paragraph in cell.paragraphs:
        cell_content.append(self._parse_paragraph(paragraph, image_map))
    return ' '.join(cell_content).strip()

Additionally, ensure that the parse_docx method calls _table_to_markdown with the image_map parameter:

def parse_docx(self, docx_path, image_folder):
    doc = DocxDocument(docx_path)
    os.makedirs(image_folder, exist_ok=True)

    content = []

    image_map = self._extract_images_from_docx(doc, image_folder)

    def parse_paragraph(paragraph):
        paragraph_content = []
        for run in paragraph.runs:
            if run.element.tag.endswith('r'):
                drawing_elements = run.element.findall(
                    './/{http://schemas.openxmlformats.org/wordprocessingml/2006/main}drawing')
                for drawing in drawing_elements:
                    blip_elements = drawing.findall(
                        './/{http://schemas.openxmlformats.org/drawingml/2006/main}blip')
                    for blip in blip_elements:
                        embed_id = blip.get(
                            '{http://schemas.openxmlformats.org/officeDocument/2006/relationships}embed')
                        if embed_id:
                            image_part = doc.part.related_parts.get(embed_id)
                            if image_part in image_map:
                                paragraph_content.append(image_map[image_part])
            if run.text.strip():
                paragraph_content.append(run.text.strip())
        return ''.join(paragraph_content) if paragraph_content else ''

    paragraphs = doc.paragraphs.copy()
    tables = doc.tables.copy()
    for element in doc.element.body:
        if element.tag.endswith('p'):  # paragraph
            para = paragraphs.pop(0)
            parsed_paragraph = parse_paragraph(para)
            if parsed_paragraph:
                content.append(parsed_paragraph)
        elif element.tag.endswith('tbl'):  # table
            table = tables.pop(0)
            content.append(self._table_to_markdown(table, image_map))
    return '\n'.join(content)

These changes should address the issues with images within table cells not being parsed and content in merged cells being duplicated [1].

To continue talking to Dosu, mention @dosu.