jalan / pdftotext

Simple PDF text extraction
MIT License
864 stars 98 forks source link

poppler/error: Failed to parse XRef entry [11].poppler/error: Top-level pages object is wrong type (null) #123

Closed juanfrilla closed 2 months ago

juanfrilla commented 3 months ago

Receiving this error on this url: poppler/error: Failed to parse XRef entry [11].poppler/error: Top-level pages object is wrong type (null) https://www.asamblea.gob.sv/sites/default/files/documents/decretos/6BD1CFE2-9948-4D32-A45D-92FF50D15C0A.pdf

And with this code:

import io
import requests
import pdftotext
url = "https://www.asamblea.gob.sv/sites/default/files/documents/decretos/6BD1CFE2-9948-4D32-A45D-92FF50D15C0A.pdf"
content = requests.get(url).content
pdf = pdftotext.PDF(io.BytesIO(content))

i'm using poppler-utils-0.26.5-43.el7.1.x86_64 pdftotext version 0.26.5 on a centos server, I don't know If I need to upgrade poppler. Is there anything I can do without upgrading poppler? Or Is there a way of catching this poppler error and skip the PDF that gives that error

jalan commented 2 months ago

Is there a way of catching this poppler error and skip the PDF that gives that error

Sure, you can include exception handling:

import io
import requests
import pdftotext

url = "https://www.asamblea.gob.sv/sites/default/files/documents/decretos/6BD1CFE2-9948-4D32-A45D-92FF50D15C0A.pdf"
content = requests.get(url).content
try:
    pdf = pdftotext.PDF(io.BytesIO(content))
except pdftotext.Error as exception:
    # Do whatever you want here
    print(f"I couldn't open that PDF: {exception}")