Closed mk-docenty closed 2 months ago
You can influence the table detection strategy like this:
import pymupdf4llm, sys
from pathlib import Path
filename = sys.argv[1]
mdtext = pymupdf4llm.to_markdown(filename, table_strategy="lines")
Path(filename + ".md").write_bytes(mdtext.encode())
The default strategy in this package is "lines_strict" - which ignores background colors that have no defined border. My modification above switches to PyMuPDF's default, leading to better results in your case.
In general, please keep in mind that table detection in general is never perfect. There is no "silver bullet", even when employing AI / ML tools there will be failures ...
Thank you so much for prompt response.
Hi, First of all thank you for this wonderful tool. There is one issue though when trying to extract some complex tables format is not preserved.
I am attaching my code, extracted markdown text and sample pages as well.
` import pymupdf4llm
md_text = pymupdf4llm.to_markdown("sample-11.pdf") import pathlib pathlib.Path("sample-11.md").write_bytes(md_text.encode()) `
Please let me know how to solve this.
Thank you
sample-8.md sample-8.pdf sample-11.md sample-11.pdf