pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.89k stars 927 forks source link

**Order of th text is mixed up and finding them in wrong places:** #557

Open uday-allu opened 3 years ago

uday-allu commented 3 years ago

Order of th text is mixed up and finding them in wrong places:

Screenshot 2020-12-02 at 4 11 05 PM

I'm using the following code:

output_string = StringIO()
with open('/Users/udayallu/similarity_search_training/Pol_ProcHdbk1_23.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

OS: Mac os python versions: 3.7 Below is the pdf file:

Pol_ProcHdbk1_23.pdf

Originally posted by @uday-allu in https://github.com/pdfminer/pdfminer.six/issues/138#issuecomment-737148670

pietermarsman commented 3 years ago

Hi!

I'm getting this output:

EMPLOYMENT 

directly impacted. The determination of when any of these events has occurred rests solely with 
the administration of the University. To view the entire Staff Reduction in Workforce Policy, 
please click here. For information on additional HR policies, please refer to the online University 
wide policy site at:  http://www.udayton.edu/policies/hr. 

Section 2 

Separation from Service 

Non-faculty employees of the University of Dayton are employed with an “at will” status. 

Employees are not employed for any definite term and either party for any reason, with or 
without cause, may terminate the employment relationship at any time. Only the President of the 
University (or the board of trustees) has authority to enter into any Agreement for employment 
for any specified period of time or to make any agreement contrary to the foregoing. 

Voluntary 

Employee Responsibility: 

 Upon resignation, all employees are requested to submit a written letter of resignation to
their immediate supervisor and the Office of Human Resources prior to their last day of
employment.

 Exempt Positions - at least four working weeks of notice prior to the date of separation

 Nonexempt Positions - at least two working weeks of notice prior to the date of

from service.

separation from service.

 The employee is asked to schedule a personal exit interview with his/her Human

Resources Generalist and complete an Exit Interview Questionnaire prior to leaving the
University.

o The terminating employee will need to bring the exit interview questionnaire to
the exit interview or mail the form to the following address: Office of Human
Resources, University of Dayton, 300 College Park, Dayton, Ohio 45469-1614,
Attn: Staffing Department.

Supervisor Responsibility: 

 Upon receiving written notification that an employee is leaving, the supervisor is

responsible for promptly completing the Personnel Action Form (PAF).

o Personal Action Form (PAF) and resignation letter are forwarded to the Office of

Human Resources to start the exit process.

 Before the employee separates from the University, the supervisor is responsible for
completing the Employee Separation Checklist (pdf), which requires collection of
University property, identification card, etc., and forwarding the completed form, along
with the separating employees ID card, to the Office of Human Resources as the final
step in the separation process.

Human Resources Responsibility: 

17 



That looks pretty good to me. Could you clarify what you would like to have improved?

uday-allu commented 3 years ago

Hi, with the above mentioned code i ran and some first paragraph is missing from the output. Below mentioned is the missing text in my output .

EMPLOYMENT directly impacted. The determination of when any of these events has occurred rests solely with the administration of the University. To view the entire Staff Reduction in Workforce Policy, please click here. For information on additional HR policies, please refer to the online University wide policy site at: http://www.udayton.edu/policies/hr.

pietermarsman commented 3 years ago

Which version of pdfminer.six are you using?

uday-allu commented 3 years ago

20201018

Hund commented 3 years ago

I just installed pdfminer.six-20201018 and sortedcontainers-2.4.0 via pip. I also have this issue:

PDF-document:

This chapter provides just enough information to edit a file with Vim. Not well or fast, but you can edit. Take some time to practice with these commands, they form the base for what follows.

Output:

This chapter provides just enough information to edit a file with Vim. Take some time to practice with these well or fast, but you can edit. commands, they form the base for what follows.

Not

It's from the document "VIM USER MANUAL" by Bram Moolenaar.

AnasAG commented 3 years ago

Similar issue, both in LTR and RTL (Arabic) languages.

ZafarShadman09 commented 3 years ago

laparams=LAParams(boxes_flow=None)

This doesn't work