Open ThorTL67 opened 2 years ago
Hi @ThorTL67. You are definitely right that this experience could be improved.
Technically, if you're running with LayoutLM (the default model), and the document contains embedded text, poppler isn't needed, since we'll use the PDF's text directly. I think an ideal change would be to fix that pathway (probably via refactoring the cached properties a bit in ) to not even place the images
in the context
object. On the flipside, the Document
object does not know which model it's being used for, so we'd additionally need to communicate that (e.g. by tracking which models do not need images here.
That said, I think improving the docs and having a pre-built docker container are no-brainer improvements. If you're up for it, I'd be very receptive to a PR! We do not have many formalities yet as this is a new project, so your request is perfect :)
Hi all,
I installed poppler-utils but still getting issue ""Unable to get page count. Is poppler installed and in PATH?" pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?"
Can you please help, what am missing on.
Thanks
Steps to reproduce (Following the QuickStart (CLI) guide):
pip install docquery
apt-get install tesseract-ocr
docsquery scan "What is the invoice number?" https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf
Observe error:
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
Fix: Install
poppler-utils
My environment: Mac OS Apple Silicon Ran via the
Python:3
Docker image.It may be worth adding to the README to install
poppler-utils
. I'm happy to open a PR for this - also happy to open a PR for a basic Docker configuration if that's something you would like.This is my first open-source contribution so apologies if I've missed some formalities - and nice project!