Closed sreeni5493 closed 3 years ago
In the PDF, there are certain out of page objects. For example, the word MEDICATION
has negative coordinates.
{'text': 'MEDICATION', 'x0': Decimal('1006.437'), 'x1': Decimal('1056.577'), 'top': Decimal('-1435.744'), 'bottom': Decimal('-1425.744'), 'upright': True, 'direction': 1}
To filter out such objects, you may crop the page like
page = page.crop((0, 0, page.width, page.height))
Thanks.
This worked. Closing the issue.
Buprenorphine.pdf
Hi,
With the following code, a lot of text from Page 1 are extracted in both Page 1 and Page 2. For example: "MEDICATION GUIDE". This text in capital is present in page 1 only. But pages[1].extract_text() is returning this and page[0].extract_text() also has this. But it is not there in page 2 at all. Similarly for "FULL PRESCRIBING INFORMATION" which is given as output in page 2 and page 1 , but it is present only in page 1.