impira / docquery

An easy way to extract information from documents
MIT License
1.71k stars 127 forks source link

Default Experience Should Not Require Poppler for PDFs #20

Open ankrgyl opened 2 years ago

ankrgyl commented 2 years ago

PDFs take advantage of Poppler to create image previews; however, these are unnecessary if the file has embedded text for certain models (e.g. LayoutLMv1). We should make sure that the default scenario of poppler not being available still works.

RamesanPP commented 1 year ago

I am facing an error with the pdf2image library and mentioning to install Poppler to PATH. This is my code:

def doc_type(temp_path):
    p = pipeline('document-question-answering')
    doc = document.load_document(temp_path)
    response = p("What type of document is this?", **doc.context)
    return response

The error I receive is : response = p("What type of document is this?", **doc.context) ^^^^^^^^^^^^ File "C:\Users\Cirruslabs\AppData\Local\Programs\Python\Python311\Lib\functools.py", line 1001, in __get__ val = self.func(instance) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\docquery\document.py", line 117, in context images = self._images ^^^^^^^^^^^^ File "C:\Users\Cirruslabs\AppData\Local\Programs\Python\Python311\Lib\functools.py", line 1001, in __get__ val = self.func(instance) ^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\docquery\document.py", line 156, in _images return [x.convert("RGB") for x in pdf2image.convert_from_bytes(self.b)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\pdf2image\pdf2image.py", line 358, in convert_from_bytes return convert_from_path( ^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\pdf2image\pdf2image.py", line 127, in convert_from_path page_count = pdfinfo_from_path( ^^^^^^^^^^^^^^^^^^ File "C:\Users\Cirruslabs\Documents\GitHub\Document-Processing-BE\venv\Lib\site-packages\pdf2image\pdf2image.py", line 594, in pdfinfo_from_path raise PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Is there any workaround to this. I've tried installing popper-utils and pdf2image and still no use.