ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents
https://docs.ropensci.org/pdftools
Other
513 stars 69 forks source link

pdf_text parse consistently skips to the next page after word "Paget" #118

Closed kenkoonwong closed 1 year ago

kenkoonwong commented 1 year ago

This is the pdf: image

This is the output:

.\n\nEndocrinology\nDiagnose pheochromocytoma in a patient with multiple endocrine neoplasia type 2A.\nDiagnose the cause of secondary hypertension.\nEvaluate a woman for infertility.\nManage euthyroid sick syndrome.\nTreat an asymptomatic patient with 

I suspect it's the word "Page" that is the problem.

I am currently using:

pdftools_3.3.2 
Using poppler version 22.04.0
jeroen commented 1 year ago

Can you include the pdf file so I can test it?

kenkoonwong commented 1 year ago

AHhh.. I realized that it was my regular expression code that cut the Paget off! I changed the code and it works just fine! Sorry for the confusion!