pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.99k stars 933 forks source link

Wrong PDF Text Order #138

Closed WtfJoke closed 6 years ago

WtfJoke commented 6 years ago

I want to extract the text of this pdf: Wochenkarte-KW-15-Neu.pdf

I get text mixed up and finding them in wrong places: Current Result:

Montag 09.04.2018 
Menü 1 

Kl. Salat 

Menü 2 

Kl. Salat 

Seelachs-Spinat-Türmchen mit Spinat-
Masalla-Sauce und Reis 
Currywurst mit Pommes 

Expected Result:

Montag 09.04.2018 
Menü 1 

Kl. Salat 
Seelachs-Spinat-Türmchen mit Spinat-
Masalla-Sauce und Reis 

Menü 2 

Kl. Salat 
Currywurst mit Pommes 

I use following code to get the text of pdf:

def convert_pdf_to_txt(path):
    resource_manager = PDFResourceManager()
    device = None
    try:
        with StringIO() as string_writer, open(path, 'rb') as pdf_file:
            device = TextConverter(resource_manager, string_writer, codec='utf-8', laparams=LAParams())
            interpreter = PDFPageInterpreter(resource_manager, device)

            for page in PDFPage.get_pages(pdf_file, maxpages=1):
                interpreter.process_page(page)

            pdf_text = string_writer.getvalue()
    finally:
        if device:
            device.close()
    return pdf_text

Do you have any idea, how I can get text in the right order?

timb07 commented 6 years ago

I've been looking at the layout code recently, so while I'm not expert at pdfminer, I think I've found a solution to your issue. See this post for a similar issue; note that the code changes in the response seem to have been made.

Using the command line pdf2txt.py wrapper script, I think the option -L 0.1 gives what you're looking for.

For your code, you'd need to provide an LAParams object with the desired setting when creating device; something like: laparams = pdfminer.layout.LAParams(line_margin=0.1). (I'm not sure where create_text_converter is defined, but it would probably need to take laparams as a parameter.)

WtfJoke commented 6 years ago

Hi Tim

Thank you so much for your feedback/answer. 😄 It worked like that! 🎉 Yeah I missed to include the create_text_converter method in my post, I edited it. There I used laparams.

Thanks a lot!

TahirHameed74 commented 4 years ago

LAParams didn't work for me i tried changing char_margin, word_margin, line_margin, line_overlap. The text order remains the same

pietermarsman commented 4 years ago

Hi @TahirHameed74,

If you have a question, please open a new issue. Don't forget to include the pdf you are using, and the code / commands you are using.

uday-allu commented 4 years ago

Order of th text is mixed up and finding them in wrong places:

Screenshot 2020-12-02 at 4 11 05 PM

I'm using the following code:

output_string = StringIO()
with open('/Users/udayallu/similarity_search_training/Pol_ProcHdbk1_23.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

OS: Mac os python versions: 3.7 Below is the pdf file:

Pol_ProcHdbk1_23.pdf