multi column pdf file text extraction

sanketpatel91 commented 1 month ago

Hello, I am reaching out regarding my recent experience with pymupdf4llm. I have a PDF file that was created from a PowerPoint presentation, and I am attempting to extract specific text elements from it.

pdf content : Text 1

sub text 1.1
sub text 1.2

Text 2

sub text 2.1
sub text 2.2

I am currently using the following code to read the PDF file:

all_pages_pdf = pymupdf4llm.to_markdown(filename, `page_chunks=True)
    for page in all_pages_pdf:
        page_number = page['metadata']['page']
        page_content = page['text']
        print(page_number)
        print(page_content)

Actual Output With V0.0.10 code : Text 1 Text 2

sub text 1.1
sub text 1.2
sub text 2.1
sub text 2.2

However, I am aiming for the following desired output: Text 1

sub text 1.1
sub text 1.2

Text 2

sub text 2.1
sub text 2.2

I would appreciate any guidance or assistance in achieving the desired output. Thank you for your attention to this matter.

JorjMcKie commented 1 month ago

Without an example file there is no way to deal with this issue.

ITHealer commented 1 month ago

Without an example file there is no way to deal with this issue.

paper02.pdf

When I use "pymupdf4llm" to extract information, it is not working correctly for papers. Do you have any idea to fix this problem?

Thanks!

ITHealer commented 3 weeks ago

@JorjMcKie

afonsoguerra commented 3 weeks ago

I'm also seeing issues with multi-column PDFs, just try to get anything from Science Magazine (for copyright reasons I can't paste any here". Works great for simple 1 column PDFs though, so thanks for that.

JorjMcKie commented 3 weeks ago

@ITHealer Why do you append your issue to another one that has nothing to do with your topic?

Please open a separate issue so I can deal appropriately with the original post.

pymupdf / RAG

multi column pdf file text extraction #78