jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Layout Detection similar to pdfminer.six #476

Closed jigsawcoder closed 3 years ago

jigsawcoder commented 3 years ago

I am working on a project to extract relevant information from resumes which are in digital pdf format. I used both pdfplumber and pdfminer.six separately. There is something that I observed.

Sample Resume :

KSAnandMurthy[12_0].pdf

While using pdfminer.six :-

Code : !python tools/pdf2txt.py /content/drive/MyDrive/resume_pdf_extractor/all_resumes_pdf/KSAnandMurthy[12_0].pdf --line-margin 0.35 #--word-margin 0.5

Output :

ANAND MURTHY

Anand is a great human being first and an Excellent Sales and Customer support professional. Finding the right mix of Customer support and Sales is difficult and Anand carries these skills well. Once a Sale or a Customer support query is given to him consider it done. His brilliant understanding of the market, his effort, market research and willingness to learn has earned him lot of accolades. I will like to work with him and strongly recommend him

Ashok Sr.Manager Head –Global Paid service, Farnell element14

I was quite amazed at the energy and "can do it" attitude displayed by Anand when given an additional new role during the Incubation of TobocBiz unit at Toboc International; he was always ready to go the extra mile, to make the new unit a big success without compromising his regular full time activities as head of CSR. Wishing Anand all the best in his future endeavours

SreejeshMadonandyPrincipal Program Manager at Amazon

“Planned, Organized and Hardworking are few qualities that wouldbest define Anand as a true professional. It a pleasure workingtogether to add value to the organization as a team. A great teamplayer ever ready to help the team whenever needed going thatextra mile. Wish him luck for all his professional milestones.

BAISAKHI SAHA, Campaign Management, Demand Generation, Digital Marketing, Marketing Automation, Analytics, and Events

Anand Murthi is a very committed and hardworking professional who always strives to complete tasks with utmost quality. He was punctual in his work and always respected the deadlines of projects. He is a great colleague and a team mate to work with and possessed avery kind and humble personality

MunniSankar, Director – Operation Lu Rui & Co

An accomplished result-driven sales professional with over 13+ years of sales experience focused on SME and Enterprise accounts. Proven ability to manage complex sales and worked as part of a cross-functional team, Develop new account 0 and efforts to achieve business development goals.

Key Skills: Cloud Sales, ERP Sales, Enterprise Sales, SaaS sales, software product sales, HCM sales, HRMS Sales, Channel Sales, Software Solution Sales, CRM Sales, Sales, software services sales, Field Sales. Compliance software

Avantis Regtech a division of Teamlease Company Oct 2019 to present (Till date)

• Responsible for revenue generation across Karnataka Market.

• Helping clients to automate their Regulatory Compliances.

Initiated contact with prospects and conducted follow-up calls to garner information and qualify leads.

• Expanded sales revenues by identifying.

Peopleworks–( a division of Crossdomain Solutions Pvt. Ltd) Manager Business Development, August 2013 – March 2019
▪ Selling SAAS based product on cloud platform
▪ Responsible for end to end sales in South India market
▪ Primary responsibilities include: Business Planning, Client Acquisition, Bulding Partnering Client requirement analysis, Process Planning, Product Training

▪ Establish PeopleWorks's brand image and credibility as a trusted aide and

Business Partner, within the client organisation.

▪ Developing and maintaining partner and prospective partner relationships ▪

Initiating and developing relationships with key decision makers / CXOs in target organizations to ensure a stronger and wider reach with the client.

▪ Setting team targets and helping them to drive their performance to achieve the

organisational goals

▪ Closely working with Digital Marketing team

TOBOC Internation, Team Leader, May 2010 – Mar 2013 ▪ Selling web application digital marketing service for SME customers
▪ Handling 4 people in Sales and 2 in Customer support team ▪ Selling Online membership and Website development and applications to B2B

▪ Handling 4 people in Sales and 2 in Customer support team ▪ Coordinate with cross functional teams including sales, engineering & support to

customers

generate new business.

Xora Software, Account Manager Representative, Oct 2007 – Mar 2009
▪ ERP selling for global market service to Sprint nextel & ATT customers
▪ Handling key accounts End to End (Sales to closure) ▪ Upselling to existing customer ▪ Provide client with the consolidated information master to quickly access

components, such as products, quotations, order confirmations, and samples, presented in an intelligible format

▪ Conducted awareness sessions and the importance of GPS Navigation system ▪ Support collaborative product development by sharing product information – in real

time

▪ Updating sales activity on NetSuite

PROTERON, Senior Sales Representative, Dec 2004 – 2007

▪ Selling Mobiles and Credit Card in International Market ▪ Maintain consistent QA parameters from client satisfaction, feedback and service

delivery

EDUCATION

▪ Bcom, Periyar University, Periyar University distance education ▪ XII – Commerce, S.M.C.P.U.C, Bangalore, Karnataka Pre University Board ▪ SSLC, Babuji .H.S, Bangalore, State Board

Anand Murthy Mobile: 9731600853

No. 2, Muniswamyappa layout Ulsoor, Bangalore - 560008.

ADDITIONAL Information refer my LinkedIn profile: Email Id: anand.murthy.ks@gmail.com in.linkedin.com/in/anandmurthyks/

While using pdfplumber :

Code :

texts = []

table_list = []
with pdfplumber.open(res_path, laparams = {"line_margin":0.35, "word_margin":0.05}) as pdf:

    for idx, pages in enumerate(pdf.pages):
      text = pages.extract_text()      
      texts.append(text)
      tables = pages.extract_table(table_settings={"explicit_vertical_lines": pages.rects, "explicit_horizontal_lines": pages.rects, "intersection_tolerance": 1})
      if tables is not None:
        table_list = tables
      else:
        pass
text_list = []

for x in range(len(texts)):
  if texts[x] is not None:
    list1 = texts[x].split('\n')
    for x in list1:
      text_list.append(x)
  else:
    pass

Output :

[' ', 'An accomplished result-driven sales professional with over 13+ years of sales ', ' ', 'ANAND MURTHY experience focused on SME and Enterprise accounts. Proven ability to manage ', 'complex sales and worked as part of a cross-functional team, Develop new account ', ' 0 ', 'and efforts to achieve business development goals. ', 'Anand is a great human being first ', 'and an Excellent Sales and ', ' ', 'Customer support professional. Key Skills: ', 'Finding the right mix of Customer Cloud Sales, ERP Sales, Enterprise Sales, SaaS sales, software product sales, HCM ', 'support and Sales is difficult and sales , HRMS Sales, Channel Sales, Software Solution Sales, CRM Sales, Sales, ', 'Anand carries these skills well. ', 'Once a Sale or a Customer software services sales, Field Sales. Compliance software ', 'support query is given to him ', 'consider it done. His brilliant Avantis Regtech a division of Teamlease Company ', 'understanding of the market, his ', 'effort, market research and Oct 2019 to present (Till date) ', 'willingness to learn has earned • Responsible for revenue generation across Karnataka Market. ', 'him lot of accolades. I will like to • Helping clients to automate their Regulatory Compliances. ', 'work with him and strongly ', '• Initiated contact with prospects and conducted follow-up calls to garner information ', 'recommend him ', ' and qualify leads. ', 'Ashok Sr.Manager Head –Global • Expanded sales revenues by identifying. ', 'Paid service, Farnell element14 ', ' ', 'I was quite amazed at the energy Peopleworks–( a division of Crossdomain Solutions Pvt. Ltd) Manager Business ', 'and "can do it" attitude displayed Development, August 2013 – March 2019 ', ' ', 'by Anand when given an ▪ Selling SAAS based product on cloud platform ', 'additional new role during the ▪ Responsible for end to end sales in South India market ', 'Incubation of TobocBiz unit at ▪ Primary responsibilities include: Business Planning, Client Acquisition, Bulding ', 'Toboc International; he was Partnering Client requirement analysis, Process Planning, Product Training ', "always ready to go the extra mile, ▪ Establish PeopleWorks's brand image and credibility as a trusted aide and ", 'to make the new unit a big Business Partner, within the client organisation. ', 'success without compromising his ▪ Developing and maintaining partner and prospective partner relationships ', 'regular full time activities as head ▪ Initiating and developing relationships with key decision makers / CXOs in target ', 'of CSR. Wishing Anand all the organizations to ensure a stronger and wider reach with the client. ', 'best in his future endeavours ▪ Setting team targets and helping them to drive their performance to achieve the ', 'organisational goals ', 'SreejeshMadonandyPrincipal ▪ Closely working with Digital Marketing team ', ' ', ' ', 'Program Manager at Amazon ', 'TOBOC Internation, Team Leader, May 2010 – Mar 2013 ', '“Planned, Organized and ▪ Selling web application digital marketing service for SME customers ', 'Hardworking are few qualities that ▪ Handling 4 people in Sales and 2 in Customer support team ', 'wouldbest define Anand as a true ▪ Selling Online membership and Website development and applications to B2B ', 'professional. It a pleasure customers ', 'workingtogether to add value to ▪ Handling 4 people in Sales and 2 in Customer support team ', 'the organization as a team. A ▪ Coordinate with cross functional teams including sales, engineering & support to ', 'generate new business. ', 'great teamplayer ever ready to ', ' ', 'help the team whenever needed ', 'Xora Software, Account Manager Representative, Oct 2007 – Mar 2009 ', 'going thatextra mile. Wish him luck ', 'for all his professional milestones. ▪ ERP selling for global market service to Sprint nextel & ATT customers ', ' ', '▪ Handling key accounts End to End (Sales to closure) ', 'BAISAKHI SAHA, Campaign ▪ Upselling to existing customer ', ' ▪ Provide client with the consolidated information master to quickly access ', 'Management, Demand ', 'components, such as products, quotations, order confirmations, and samples, ', 'Generation, Digital Marketing, ', 'presented in an intelligible format ', 'Marketing Automation, Analytics, ', '▪ Conducted awareness sessions and the importance of GPS Navigation system ', 'and Events ', '▪ Support collaborative product development by sharing product information – in real ', ' time ', 'Anand Murthi is a very committed ', '▪ Updating sales activity on NetSuite ', 'and hardworking professional who ', ' ', 'always strives to complete tasks ', 'with utmost quality. He was ', 'PROTERON, Senior Sales Representative, Dec 2004 – 2007 ', 'punctual in his work and always ', ' ', 'respected the deadlines of ▪ Selling Mobiles and Credit Card in International Market ', 'projects. He is a great colleague ▪ Maintain consistent QA parameters from client satisfaction, feedback and service ', 'and a team mate to work with and delivery ', 'possessed avery kind and humble ', 'personality ', 'EDUCATION ', 'MunniSankar, Director – ▪ Bcom, Periyar University, Periyar University distance education ', 'Operation Lu Rui & Co ▪ XII – Commerce, S.M.C.P.U.C, Bangalore, Karnataka Pre University Board ', '▪ SSLC, Babuji .H.S, Bangalore, State Board ', ' Anan d Murthy Mobile: 9731600853 ', ' EAmDaDilI TIdIO: NanAaLn Idn.fmorumrathtiyo.nk sr@efegrm mayil .Lcionmke dIn profile: ', 'in.linkedin.com/in/anandmurthyks/ ', ' No. 2, Muniswamyappa layout Ulsoor, Bangalore - 560008. ']

Observation and Expectation :

Here I observed that pdfminer.six is doing a far better job to detect the layout of the resume. Which is what I was expecting from pdfplumber as it is written above pdfminer.six.

I think pdfplumber is parsing the text line wise and not layout wise as pdfminer.six

As, pdfplumber is doing a better job detecting texts and table, I cant just ignore pdfplumber and use pdfminer.six for all my resumes.

I am pretty much sure I am missing something here or is there something you guys wanna add in order to achieve my expectations.

I want to use only pdfplumber for this project.

THANK YOU

samkit-jain commented 3 years ago

Hi @jigsawcoder Appreciate your interest in the library and yes, can see how it will be a useful feature. I am closing this issue as there is already an existing issue present that is #10 Feel free to raise a PR if you have a solution in mind or continue the discussion at #10.