jsvine / pdfplumber

Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
MIT License
6.57k stars 659 forks source link

Page 1 text extracted in Page 2 too #395

Closed sreeni5493 closed 3 years ago

sreeni5493 commented 3 years ago

Buprenorphine.pdf

import pdfplumber
pdf=pdfplumber.open('./Buprenorphine.pdf')
pages=pdf.pages
print(pages[0].extract_text())
print(pages[1].extract_text())

Hi,

With the following code, a lot of text from Page 1 are extracted in both Page 1 and Page 2. For example: "MEDICATION GUIDE". This text in capital is present in page 1 only. But pages[1].extract_text() is returning this and page[0].extract_text() also has this. But it is not there in page 2 at all. Similarly for "FULL PRESCRIBING INFORMATION" which is given as output in page 2 and page 1 , but it is present only in page 1.

samkit-jain commented 3 years ago

In the PDF, there are certain out of page objects. For example, the word MEDICATION has negative coordinates.

{'text': 'MEDICATION', 'x0': Decimal('1006.437'), 'x1': Decimal('1056.577'), 'top': Decimal('-1435.744'), 'bottom': Decimal('-1425.744'), 'upright': True, 'direction': 1}

To filter out such objects, you may crop the page like

page = page.crop((0, 0, page.width, page.height))
sreeni5493 commented 3 years ago

Thanks.

This worked. Closing the issue.