Closed yewool0818 closed 4 months ago
Please provide the example PDF. Otherwise there is no way to deal with the problem.
@JorjMcKie
sample_pdf_data.pdf This is the sample pdf file.
It looks like that
but after converting to markdown, code line is disappeared.
# convert PDF to Markdown Test
This is my code.
result
but code is not recognized.
I used this code to converting.
def pdf_to_markdown(pdf_file, t_md_file):
md_text = pymupdf4llm.to_markdown(pdf_file)
print(md_text)
# Split the text into lines
lines = md_text.split('\n')
# Patterns to match entire line
patterns = [r"^\*\*(\d+) ([^\*]+)\*\* \*\*(\d+)\*\*$", r"^-----$"]
# Filter lines
filtered_lines = [line for line in lines if not any(re.match(pattern, line) for pattern in patterns)]
# Join the filtered lines back into a single string
filtered_text = '\n'.join(filtered_lines)
with open(t_md_file, 'w', encoding='utf-8') as f:
f.write(filtered_text + '\n')
Thank you for the file. This is an interesting situation.
The new version has improved logic to differentiate between vector graphics that simply is background shading for text (thus insignificant) and something potentially significant like Gantt charts.
This is done by checking whether only "fill" graphics are present. If there are "stroke" type graphics, then this is assumed to be potentially significant drawings.
Your example contains shaded boxes that also have stroked lines (the borders around your code pieces) with rounded corners. Because of the borders, there is no way to determine that these drawings are insignificant and can really be ignored.
If you remove the borders of the shaded code pieces, things will work.
What also will work is this:
import pathlib
import pymupdf4llm
print(f"{pymupdf4llm.version=}")
md = pymupdf4llm.to_markdown("test.pdf", write_images=True)
pathlib.Path("test.md").write_bytes(md.encode())
The resulting markdown looks like this:
# convert PDF to Markdown Test
This is my code.
![test.pdf-0-0.png](test.pdf-0-0.png)
result
![test.pdf-0-1.png](test.pdf-0-1.png)
but code is not recognized.
-----
In other words: The code pieces are interpreted as significant graphics that must be converted to images. When rendering the markdown file, it will look like this:
Hello pymupdf4llm Maintainers,
I've recently upgraded to version 0.0.6 of the pymupdf4llm library and encountered an issue where the library fails to recognize source code within PDF files. In the previous version (0.0.5), the source code in PDFs was recognized and processed without any issues.
Issue Details
Environment
Steps to Reproduce
Expected Behavior
The library should recognize and extract the source code from the PDF.
Actual Behavior
The library does not recognize the source code in the PDF.
Additional Information
The issue was not present in version 0.0.5. No changes were made to the PDF files between tests.
I would appreciate any guidance on this issue or any updates regarding a potential fix. Thank you for your attention to this matter.