Closed 12jakubpavel closed 3 years ago
Hi @12jakubpavel Appreciate your interest in the library. Request you to update the actual and expected behaviour sections as they both are the same (the text in quotes). Also, running extract_text()
on the first page results in
1. Execution(s)
The transferor(s) accept(s) the above consideration and understand(s) that the instrument operates to transfer the freehold estate in the land described above to the
transferee(s).
Witnessing Officer Signature Execution Date Transferor Signature(s)
Planet Express
YYYY-MM-DD
By their Authorized Signatory
__________________________________
2020-09-01
Fred Lawyer
Barrister & Solicitor __________________________________
123 Fred Street Hubert
Fredsville BC V8W 9W6
additionalInfo
The
Planet Express
YYYY-MM-DD
By their Authorized Signatory
section looks correct to me. What would you expect it to be?
Hi. @samkit-jain The point is that it will convert me to "YYYY-MM-DD Planet Express" or "YYYY-MM-DD By their Authorized Signatory". I need him to always do the same. Only one of them. It often happens to me that the first run of test passes but for the second time the same test with the same PDF does not pass
That's weird and unexpected. To clarify, you are saying that running a piece of code on the same PDF yields different results and not the same results on running multiple times. If so, could you share the .py file so that I can run and simulate the behaviour on my machine? The code you shared above is not properly formatted.
Running print(pdfplumber.open("Plumber-Bug.pdf").pages[0].extract_text())
, repeatedly on the same PDF, produces consistent results, as it did for you, @samkit-jain:
[...]
Witnessing Officer Signature Execution Date Transferor Signature(s)
Planet Express
YYYY-MM-DD
By their Authorized Signatory
[...]
Perhaps this is an issue with the particular code/logic you're using @12jakubpavel? If not, feel free to reopen this issue with code that reproduces the problem. Thanks!
At the weekend I will try to convert 100 the same PDF and send the results.
I used robot framewrok. Here is my code
Describe the bug
If I convert a PDF file to text using the plumber library. So it happens to me that when the text is between two words, it converts random to text.
Code to reproduce the problem
${PDFText}= Convert PDF To Text By Plumber ${PDFText} ${PDFText}= Join Spaces ${PDFText}
def get_pdf_text_by_plumber(path): text = "" with pdfplumber.open(path) as pdf: for page in pdf.pages: text += page.extract_text() return text
def join_spaces(text): return ' '.join(filter(None, text.split(' ')))
PDF file
Plumber-Bug.pdf
Expected behavior
Text should be either as this"YYYY-MM-DD Planet Express" or as "YYYY-MM-DD By their Authorized Signatory"
Actual behavior
Either converts the text as "YYYY-MM-DD Planet Express" or as "YYYY-MM-DD By their Authorized Signatory"
Screenshots
If applicable, add screenshots to help explain your problem.
Environment