Source Code Not Recognized in PDF Files in Version 0.0.6

yewool0818 commented 4 months ago

Hello pymupdf4llm Maintainers,

I've recently upgraded to version 0.0.6 of the pymupdf4llm library and encountered an issue where the library fails to recognize source code within PDF files. In the previous version (0.0.5), the source code in PDFs was recognized and processed without any issues.

Issue Details

Environment

windows
python 3.12

Steps to Reproduce

Upgrade pymupdf4llm to version 0.0.6.
Load a PDF file containing source code.
Attempt to process the file using the library's relevant functions.

Expected Behavior

The library should recognize and extract the source code from the PDF.

Actual Behavior

The library does not recognize the source code in the PDF.

Additional Information

The issue was not present in version 0.0.5. No changes were made to the PDF files between tests.

I would appreciate any guidance on this issue or any updates regarding a potential fix. Thank you for your attention to this matter.

JorjMcKie commented 4 months ago

Please provide the example PDF. Otherwise there is no way to deal with the problem.

yewool0818 commented 4 months ago

@JorjMcKie

sample_pdf_data.pdf This is the sample pdf file.

It looks like that

but after converting to markdown, code line is disappeared.

# convert PDF to Markdown Test

This is my code.

result

but code is not recognized.

I used this code to converting.

def pdf_to_markdown(pdf_file, t_md_file):
    md_text = pymupdf4llm.to_markdown(pdf_file)
    print(md_text)
    # Split the text into lines
    lines = md_text.split('\n')

    # Patterns to match entire line
    patterns = [r"^\*\*(\d+) ([^\*]+)\*\* \*\*(\d+)\*\*$", r"^-----$"]

    # Filter lines
    filtered_lines = [line for line in lines if not any(re.match(pattern, line) for pattern in patterns)]

    # Join the filtered lines back into a single string
    filtered_text = '\n'.join(filtered_lines)
    with open(t_md_file, 'w', encoding='utf-8') as f:
        f.write(filtered_text + '\n')

JorjMcKie commented 4 months ago

Thank you for the file. This is an interesting situation.

The new version has improved logic to differentiate between vector graphics that simply is background shading for text (thus insignificant) and something potentially significant like Gantt charts.

This is done by checking whether only "fill" graphics are present. If there are "stroke" type graphics, then this is assumed to be potentially significant drawings.

Your example contains shaded boxes that also have stroked lines (the borders around your code pieces) with rounded corners. Because of the borders, there is no way to determine that these drawings are insignificant and can really be ignored.

If you remove the borders of the shaded code pieces, things will work.

What also will work is this:

import pathlib

import pymupdf4llm

print(f"{pymupdf4llm.version=}")
md = pymupdf4llm.to_markdown("test.pdf", write_images=True)
pathlib.Path("test.md").write_bytes(md.encode())

The resulting markdown looks like this:

# convert PDF to Markdown Test

This is my code.

![test.pdf-0-0.png](test.pdf-0-0.png)

result

![test.pdf-0-1.png](test.pdf-0-1.png)

but code is not recognized.

-----

In other words: The code pieces are interpreted as significant graphics that must be converted to images. When rendering the markdown file, it will look like this:

pymupdf / RAG