Text Extraktion from docx and pptx files

simonschoe commented 3 months ago

Hi there, on the website you state that text can be extracted from all sorts of documents (e.g., docx and pptx): https://pymupdf4llm.readthedocs.io/en/latest/. Are there any examples how I would best proceed if I have docx and/or pptx files instead of PDF files?

hewliyang commented 3 months ago

convert them to PDFs

JorjMcKie commented 3 months ago

You can directly use them by their filenames like "document.docx". The issue is that page sizes in these cases are fluid "reflowable", and no tables or text columns are recognized. It is recommended to therefore regard the full document as one large page, which is the default: height is set to None.

simonschoe commented 3 months ago

You can directly use them by their filenames like "document.docx". The issue is that page sizes in these cases are fluid "reflowable", and no tables or text columns are recognized. It is recommended to therefore regard the full document as one large page, which is the default: height is set to None.

When I try reading the docx file directly, I currently obtain an AssertionError:

import pymupdf4llm
elements = pymupdf4llm.to_markdown("testfile.docx")

...

~\AppData\Roaming\Python\Python311\site-packages\pymupdf\__init__.py in ?(page, required)
    333         return page
    334     elif isinstance(page, mupdf.FzPage):
    335         ret = mupdf.pdf_page_from_fz_page(page)
    336         if required:
--> 337             assert ret.m_internal
    338         return ret
    339     elif page is None:
    340         assert 0, f'page is None'

AssertionError:

JorjMcKie commented 3 months ago

I just confirmed correct behavior using pymupdf4llm v0.0.10 and pymupdf v1.24.9. Transferring this thread to Discussions tab.

pymupdf / RAG

Text Extraktion from docx and pptx files #91