Quickstart CLI not working with PDFs

ThorTL67 commented 2 years ago

Steps to reproduce (Following the QuickStart (CLI) guide):

Run pip install docquery
Run apt-get install tesseract-ocr
Run docsquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf

Observe error: pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?

Fix: Install `poppler-utils`

My environment: Mac OS Apple Silicon Ran via the Python:3 Docker image.

It may be worth adding to the README to install poppler-utils. I'm happy to open a PR for this - also happy to open a PR for a basic Docker configuration if that's something you would like.

This is my first open-source contribution so apologies if I've missed some formalities - and nice project!

ankrgyl commented 2 years ago

Hi @ThorTL67. You are definitely right that this experience could be improved.

Technically, if you're running with LayoutLM (the default model), and the document contains embedded text, poppler isn't needed, since we'll use the PDF's text directly. I think an ideal change would be to fix that pathway (probably via refactoring the cached properties a bit in ) to not even place the images in the context object. On the flipside, the Document object does not know which model it's being used for, so we'd additionally need to communicate that (e.g. by tracking which models do not need images here.

That said, I think improving the docs and having a pre-built docker container are no-brainer improvements. If you're up for it, I'd be very receptive to a PR! We do not have many formalities yet as this is a new project, so your request is perfect :)

deeptigoyal commented 1 year ago

Hi all,

I installed poppler-utils but still getting issue ""Unable to get page count. Is poppler installed and in PATH?" pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"

Can you please help, what am missing on.

Thanks

impira / docquery