pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
4.52k stars 446 forks source link

How to extract pdf page text line by line? #3550

Closed mikejokic closed 3 weeks ago

mikejokic commented 4 weeks ago

I am trying to extract pdf text line by line.

I have tried

UMNwriteup.pdf

doc = fitz.open("UMNwriteup.pdf")
page =doc.load_page(0)

Option 1 page.get_text('text').split("\n")

but that results in some lines being broken up into chunks (because spacing between words in one sentence is too much and a new line character is inputted.

Option 2 page.get_text('blocks')

That is more towards what I'm looking for, but some chunks (multi-line sentences) are intelligently grouped together.

Option 3


dictionary_elements = page.get_text('dict')
for block in dictionary_elements['blocks']:
    line_text = ''
    for line in block['lines']:
        for span in line['spans']:
             line_text += ' ' + span['text']

This results in output similar to option 2.

So how do I extract text line by line, without any chunking / blocks behinds the scenes?

If I can stop putting new line characters between two words that are separated by blank spaces (even though on same bbox height), that should solve this for me.

Hi @JorjMcKie Thanks for any help.

JorjMcKie commented 3 weeks ago

This is no bug. But there is a way to get correct results. Please continue in the Discussions tab.