Closed WtfJoke closed 6 years ago
I've been looking at the layout code recently, so while I'm not expert at pdfminer, I think I've found a solution to your issue. See this post for a similar issue; note that the code changes in the response seem to have been made.
Using the command line pdf2txt.py
wrapper script, I think the option -L 0.1
gives what you're looking for.
For your code, you'd need to provide an LAParams
object with the desired setting when creating device
; something like: laparams = pdfminer.layout.LAParams(line_margin=0.1)
. (I'm not sure where create_text_converter
is defined, but it would probably need to take laparams
as a parameter.)
Hi Tim
Thank you so much for your feedback/answer. 😄 It worked like that! 🎉 Yeah I missed to include the create_text_converter method in my post, I edited it. There I used laparams.
Thanks a lot!
LAParams didn't work for me i tried changing char_margin, word_margin, line_margin, line_overlap. The text order remains the same
Hi @TahirHameed74,
If you have a question, please open a new issue. Don't forget to include the pdf you are using, and the code / commands you are using.
Order of th text is mixed up and finding them in wrong places:
I'm using the following code:
output_string = StringIO()
with open('/Users/udayallu/similarity_search_training/Pol_ProcHdbk1_23.pdf', 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
print(output_string.getvalue())
OS: Mac os python versions: 3.7 Below is the pdf file:
I want to extract the text of this pdf: Wochenkarte-KW-15-Neu.pdf
I get text mixed up and finding them in wrong places: Current Result:
Expected Result:
I use following code to get the text of pdf:
Do you have any idea, how I can get text in the right order?