jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Between Two Words --bug #322

Closed 12jakubpavel closed 3 years ago

12jakubpavel commented 3 years ago

Describe the bug

If I convert a PDF file to text using the plumber library. So it happens to me that when the text is between two words, it converts random to text.

Code to reproduce the problem

${PDFText}= Convert PDF To Text By Plumber ${PDFText} ${PDFText}= Join Spaces ${PDFText}

def get_pdf_text_by_plumber(path): text = "" with pdfplumber.open(path) as pdf: for page in pdf.pages: text += page.extract_text() return text

def join_spaces(text): return ' '.join(filter(None, text.split(' ')))

PDF file

Plumber-Bug.pdf

Expected behavior

Text should be either as this"YYYY-MM-DD Planet Express" or as "YYYY-MM-DD By their Authorized Signatory"

Actual behavior

Either converts the text as "YYYY-MM-DD Planet Express" or as "YYYY-MM-DD By their Authorized Signatory"

Screenshots

If applicable, add screenshots to help explain your problem. obrazek

Environment

samkit-jain commented 3 years ago

Hi @12jakubpavel Appreciate your interest in the library. Request you to update the actual and expected behaviour sections as they both are the same (the text in quotes). Also, running extract_text() on the first page results in

1. Execution(s) 
The transferor(s) accept(s) the above consideration and understand(s) that the instrument operates to transfer the freehold estate in the land described above to the 
transferee(s). 

Witnessing Officer Signature  Execution Date  Transferor Signature(s) 
  Planet Express 
YYYY-MM-DD 
By their Authorized Signatory 

__________________________________ 
2020-09-01 

Fred   Lawyer 
Barrister & Solicitor  __________________________________ 
123 Fred Street  Hubert
Fredsville BC V8W 9W6 

additionalInfo 

The

  Planet Express 
YYYY-MM-DD 
By their Authorized Signatory 

section looks correct to me. What would you expect it to be?

12jakubpavel commented 3 years ago

Hi. @samkit-jain The point is that it will convert me to "YYYY-MM-DD Planet Express" or "YYYY-MM-DD By their Authorized Signatory". I need him to always do the same. Only one of them. It often happens to me that the first run of test passes but for the second time the same test with the same PDF does not pass

samkit-jain commented 3 years ago

That's weird and unexpected. To clarify, you are saying that running a piece of code on the same PDF yields different results and not the same results on running multiple times. If so, could you share the .py file so that I can run and simulate the behaviour on my machine? The code you shared above is not properly formatted.

jsvine commented 3 years ago

Running print(pdfplumber.open("Plumber-Bug.pdf").pages[0].extract_text()), repeatedly on the same PDF, produces consistent results, as it did for you, @samkit-jain:

[...]
Witnessing Officer Signature  Execution Date  Transferor Signature(s)
  Planet Express
YYYY-MM-DD
By their Authorized Signatory
[...]

Perhaps this is an issue with the particular code/logic you're using @12jakubpavel? If not, feel free to reopen this issue with code that reproduces the problem. Thanks!

12jakubpavel commented 3 years ago

At the weekend I will try to convert 100 the same PDF and send the results.

12jakubpavel commented 3 years ago

I used robot framewrok. Here is my code obrazek