Closed simonschoe closed 3 months ago
convert them to PDFs
You can directly use them by their filenames like "document.docx". The issue is that page sizes in these cases are fluid "reflowable", and no tables or text columns are recognized. It is recommended to therefore regard the full document as one large page, which is the default: height is set to None.
You can directly use them by their filenames like "document.docx". The issue is that page sizes in these cases are fluid "reflowable", and no tables or text columns are recognized. It is recommended to therefore regard the full document as one large page, which is the default: height is set to None.
When I try reading the docx
file directly, I currently obtain an AssertionError
:
import pymupdf4llm
elements = pymupdf4llm.to_markdown("testfile.docx")
...
~\AppData\Roaming\Python\Python311\site-packages\pymupdf\__init__.py in ?(page, required)
333 return page
334 elif isinstance(page, mupdf.FzPage):
335 ret = mupdf.pdf_page_from_fz_page(page)
336 if required:
--> 337 assert ret.m_internal
338 return ret
339 elif page is None:
340 assert 0, f'page is None'
AssertionError:
I just confirmed correct behavior using pymupdf4llm v0.0.10 and pymupdf v1.24.9. Transferring this thread to Discussions tab.
Hi there, on the website you state that text can be extracted from all sorts of documents (e.g., docx and pptx): https://pymupdf4llm.readthedocs.io/en/latest/. Are there any examples how I would best proceed if I have docx and/or pptx files instead of PDF files?