impira / docquery

An easy way to extract information from documents
MIT License
1.7k stars 126 forks source link

Quickstart CLI not working with PDFs #8

Open ThorTL67 opened 2 years ago

ThorTL67 commented 2 years ago

Steps to reproduce (Following the QuickStart (CLI) guide):

Observe error: pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Fix: Install poppler-utils

My environment: Mac OS Apple Silicon Ran via the Python:3 Docker image.

It may be worth adding to the README to install poppler-utils. I'm happy to open a PR for this - also happy to open a PR for a basic Docker configuration if that's something you would like.

This is my first open-source contribution so apologies if I've missed some formalities - and nice project!

ankrgyl commented 2 years ago

Hi @ThorTL67. You are definitely right that this experience could be improved.

Technically, if you're running with LayoutLM (the default model), and the document contains embedded text, poppler isn't needed, since we'll use the PDF's text directly. I think an ideal change would be to fix that pathway (probably via refactoring the cached properties a bit in ) to not even place the images in the context object. On the flipside, the Document object does not know which model it's being used for, so we'd additionally need to communicate that (e.g. by tracking which models do not need images here.

That said, I think improving the docs and having a pre-built docker container are no-brainer improvements. If you're up for it, I'd be very receptive to a PR! We do not have many formalities yet as this is a new project, so your request is perfect :)

deeptigoyal commented 1 year ago

Hi all,

I installed poppler-utils but still getting issue ""Unable to get page count. Is poppler installed and in PATH?" pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"

Can you please help, what am missing on.

Thanks