pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF
https://pymupdf.readthedocs.io/en/latest/pymupdf4llm
GNU Affero General Public License v3.0
302 stars 57 forks source link

Table formatting/ Table format extraction issue #69

Closed mk-docenty closed 2 months ago

mk-docenty commented 2 months ago

Hi, First of all thank you for this wonderful tool. There is one issue though when trying to extract some complex tables format is not preserved.

I am attaching my code, extracted markdown text and sample pages as well.

` import pymupdf4llm

md_text = pymupdf4llm.to_markdown("sample-11.pdf") import pathlib pathlib.Path("sample-11.md").write_bytes(md_text.encode()) `

Please let me know how to solve this.

Thank you

sample-8.md sample-8.pdf sample-11.md sample-11.pdf

JorjMcKie commented 2 months ago

You can influence the table detection strategy like this:

import pymupdf4llm, sys
from pathlib import Path

filename = sys.argv[1]
mdtext = pymupdf4llm.to_markdown(filename, table_strategy="lines")
Path(filename + ".md").write_bytes(mdtext.encode())

The default strategy in this package is "lines_strict" - which ignores background colors that have no defined border. My modification above switches to PyMuPDF's default, leading to better results in your case.

In general, please keep in mind that table detection in general is never perfect. There is no "silver bullet", even when employing AI / ML tools there will be failures ...

mk-docenty commented 2 months ago

Thank you so much for prompt response.