jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.47k stars 658 forks source link

No space between words in extracted text #334

Closed sivakumar05 closed 3 years ago

sivakumar05 commented 3 years ago

Hi Jsvine\Others,

I'm using 'pdfplumber' library related functions to extract text data from pdf files. Except for one file, from remaining files, I could extract data correctly. Please find below for details.

Issue: In the extracted text I don't see space between words but space between words is present in input file.

Syntax used to extract text :

import pdfplumber filename='Vishwa_Srivastava_CV_Sep15.pdf' with pdfplumber.open(filename) as pdf: first_page = pdf.pages[0]
text = first_page.extract_text().split('\n') text=text.lower()

Output:

'vishwa srivastava\nentrepreneur|ex-managementconsultant\nvishwa.srivastava25@gmail.com +91-9560677151 bangalore,india\nǜ linkedin.com/in/vishwa-srivastava (cid:211) ‰\nfl\nprofessional experience exposure & skillsets\nco-founder&ceo capitalmarkets wealthmanagement\npvot.in|stockmarketadvisorsmarketplace e-commerce retail industrialgoods\n2018–2020 bengaluru,india metals railways oil&gas\n(cid:17) ‰\nbuiltfullybootstrappedbusinessfromscratch,definedrevenuemodel,go-to-\n•\nmarketstrategyfortheproduct,drovecustomerandpartneracquisition\ngtmstrategy fundraising\nsecuredaninvestmenttermsheetatusd1.06mnpre-moneyvaluation\n•\non-boarded50+expertsandpartneredwithleadingbrokeragesandp2p productmanagement programmgmt.\n•\nlendingcompaniesonarevenuesharingagreement\nmixpanel wireframing\n

Please suggest me required corrections for my syntax to read the text with space between words.

Let me know for any additional details.

Thanks & Regards, Siva

samkit-jain commented 3 years ago

Hi @sivakumar05 Appreciate your interest in the library. Could you please attach the PDF as well and the code you used to extract the text from it?

sivakumar05 commented 3 years ago

[Vishwa_Srivastava_CV_Sep15.pdf](https://github.com/jsvine/pdfplumber/files/5792537/Vishwa_Srivastava_CV_Sep15.pdf)

Syntax: import pdfplumber filename='Vishwa_Srivastava_CV_Sep15.pdf' with pdfplumber.open(filename) as pdf: first_page = pdf.pages[0] text = first_page.extract_text().split('\n')

text=text.lower()

samkit-jain commented 3 years ago

Thanks for sharing the PDF @sivakumar05. The .extract_text(...) methods takes in 2 optional arguments x_tolerance and y_tolerance.

In your case, you can use a smaller value than 3 like 1 for x_tolerance. With page.extract_text(x_tolerance=1), the output becomes

VISHWA SRIVASTAVA
Entrepreneur | Ex-Management Consultant
vishwa.srivastava25@gmail.com +91-9560677151 Bangalore, India
[ linkedin.com/in/vishwa-srivastava (cid:211) ‰
fl
PROFESSIONAL EXPERIENCE EXPOSURE & SKILLSETS
Co-Founder & CEO Capital Markets Wealth Management
Pvot.in |Stock Market Advisors Marketplace E-commerce Retail Industrial Goods
2018 – 2020 Bengaluru, India Metals Railways Oil & Gas
(cid:17) ‰
Built fully bootstrapped business from scratch, defined revenue model, Go-To-
•
Market Strategy for the product, drove customer and partner acquisition
GTM Strategy Fund Raising
Secured an investment term sheet at USD 1.06 Mn Pre-money valuation
•
On-boarded 50+ Experts and partnered with leading brokerages and P2P Product Management Program Mgmt.
•
lending companies on a revenue sharing agreement
Mix Panel Wireframing
Achieved >60% DAU among experts by building high engagement features
•
Marketing Strategy Market Assessment
Built & managed a 11 member team on tech, marketing, UX and content
•
Led daily scrum meetings with developers to progress on product roadmap User Research ASO
•
EDUCATION
Management Consultant
Accenture
MBA - Strategy & Operations
2019 – 2019 Bengaluru, India
Indian School of Business
(cid:17)EBITDA Improvement using Advanc‰ed Analytics | Metals
Identified & sized opportunities to apply machine learning models to improve 2016 – 2017
•
throughput & reduce cost for one of India’s largest steel manufacturer (cid:17)B.E. in Chemical Engineering
Deployed analytical models addressing opportunities worth USD 23 Mn
• MS University of Baroda
across Iron making value chain
2005 – 2009
Senior Consultant (cid:17)ACHIEVEMENTS
KPMG
2017 – 2019 Mumbai, India KPMG Kudos Award
(cid:17)Route to Market Strategy Transform‰ation | Retail & Industrial Goods Going extra mile to achieve desired re-
3
Designed new organizational structure to align with GTM strategies sults and building strong client relation-
•
ships (2018)
Set up marketing vertical and created Pan-India ATL/BTL activation plan
•
Conceptualized new product promotion schemes, conducted portfolio ratio- KPMG Super Team Award
•
nalization, ideated partner loyalty program and conducted vendor tie-ups Outstanding client service and excep-
(cid:143)
Designed remuneration – commission and incentives for distributors and tional team work (2017)
•
Sales Force, basis ROI & competitive benchmarking, helping win market 3 Commendation by Iraqi minister of
pct share across retail segments in Western & Southern zones
Natural Resources

Growth Strategy & Market Assessment | Metals & Railways
Ensuring zero downtime & ontime de-
Downstream opportunity identification & sizing for an Indian MNC livery during volatile conditions(2014)
•
Prepared investment proposal for shortlisted high growth and high EBITDA
• Honeywell Bravo Award
downstream value added sectors
Delivering outstanding customer ser-
Market Entry & Location Assessment | Petrochemicals 3
vice in Taiwan (2012)
Strategy for Indian entry via greenfield expansion for a South Korean client
• LANGUAGES
Process Consultant
Honeywell English
2009 - 2016 USA, India, EMEA, LatAm, SEA Hindi ○ ○ ○ ○ ○
(cid:17)20+ Operations improvement engag‰ements with global O&G Majors Spanish ○ ○ ○ ○ ○
○ ○ ○ ○ ○
Volunteered to lead 1st project for refinery capacity debottlenecking in Iraq
•
Regularly managed teams of 30-40 contract workers & engineers during op-
•
erationalization phase of engagements
Led intra-SBU team to commercialize Honeywell’s IoT based suite, optimizing
•
Upstream & Downstream operations
charan7799 commented 1 year ago

Thanks for sharing the PDF @sivakumar05. The .extract_text(...) methods takes in 2 optional arguments x_tolerance and y_tolerance.

* `x_tolerance` - Adds a space where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Defaults to 3.

* `y_tolerance` - Adds a newline character where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`. Defaults to 3.

In your case, you can use a smaller value than 3 like 1 for x_tolerance. With page.extract_text(x_tolerance=1), the output becomes

VISHWA SRIVASTAVA
Entrepreneur | Ex-Management Consultant
vishwa.srivastava25@gmail.com +91-9560677151 Bangalore, India
[ linkedin.com/in/vishwa-srivastava (cid:211) ‰
fl
PROFESSIONAL EXPERIENCE EXPOSURE & SKILLSETS
Co-Founder & CEO Capital Markets Wealth Management
Pvot.in |Stock Market Advisors Marketplace E-commerce Retail Industrial Goods
2018 – 2020 Bengaluru, India Metals Railways Oil & Gas
(cid:17) ‰
Built fully bootstrapped business from scratch, defined revenue model, Go-To-
•
Market Strategy for the product, drove customer and partner acquisition
GTM Strategy Fund Raising
Secured an investment term sheet at USD 1.06 Mn Pre-money valuation
•
On-boarded 50+ Experts and partnered with leading brokerages and P2P Product Management Program Mgmt.
•
lending companies on a revenue sharing agreement
Mix Panel Wireframing
Achieved >60% DAU among experts by building high engagement features
•
Marketing Strategy Market Assessment
Built & managed a 11 member team on tech, marketing, UX and content
•
Led daily scrum meetings with developers to progress on product roadmap User Research ASO
•
EDUCATION
Management Consultant
Accenture
MBA - Strategy & Operations
2019 – 2019 Bengaluru, India
Indian School of Business
(cid:17)EBITDA Improvement using Advanc‰ed Analytics | Metals
Identified & sized opportunities to apply machine learning models to improve 2016 – 2017
•
throughput & reduce cost for one of India’s largest steel manufacturer (cid:17)B.E. in Chemical Engineering
Deployed analytical models addressing opportunities worth USD 23 Mn
• MS University of Baroda
across Iron making value chain
2005 – 2009
Senior Consultant (cid:17)ACHIEVEMENTS
KPMG
2017 – 2019 Mumbai, India KPMG Kudos Award
(cid:17)Route to Market Strategy Transform‰ation | Retail & Industrial Goods Going extra mile to achieve desired re-
3
Designed new organizational structure to align with GTM strategies sults and building strong client relation-
•
ships (2018)
Set up marketing vertical and created Pan-India ATL/BTL activation plan
•
Conceptualized new product promotion schemes, conducted portfolio ratio- KPMG Super Team Award
•
nalization, ideated partner loyalty program and conducted vendor tie-ups Outstanding client service and excep-
(cid:143)
Designed remuneration – commission and incentives for distributors and tional team work (2017)
•
Sales Force, basis ROI & competitive benchmarking, helping win market 3 Commendation by Iraqi minister of
pct share across retail segments in Western & Southern zones
Natural Resources

Growth Strategy & Market Assessment | Metals & Railways
Ensuring zero downtime & ontime de-
Downstream opportunity identification & sizing for an Indian MNC livery during volatile conditions(2014)
•
Prepared investment proposal for shortlisted high growth and high EBITDA
• Honeywell Bravo Award
downstream value added sectors
Delivering outstanding customer ser-
Market Entry & Location Assessment | Petrochemicals 3
vice in Taiwan (2012)
Strategy for Indian entry via greenfield expansion for a South Korean client
• LANGUAGES
Process Consultant
Honeywell English
2009 - 2016 USA, India, EMEA, LatAm, SEA Hindi ○ ○ ○ ○ ○
(cid:17)20+ Operations improvement engag‰ements with global O&G Majors Spanish ○ ○ ○ ○ ○
○ ○ ○ ○ ○
Volunteered to lead 1st project for refinery capacity debottlenecking in Iraq
•
Regularly managed teams of 30-40 contract workers & engineers during op-
•
erationalization phase of engagements
Led intra-SBU team to commercialize Honeywell’s IoT based suite, optimizing
•
Upstream & Downstream operations

Hi, How can I seperate text in the same line for example in the above case, PROFESSIONAL EXPERIENCE EXPOSURE & SKILLSETS are printed in the same line even though they're in the different headers of the given pdf So how can i get them to look like this(PROFESSIONAL EXPERIENCE \n EXPOSURE & SKILLSETS)? Thanks

samkit-jain commented 1 year ago

@charan7799 You can use y_tolerance to add new lines