shahrukhx01 / multilingual-pdf2text

A python library for extracting text from PDFs without losing the formatting of the PDF content.
MIT License
72 stars 11 forks source link

Poppler dependency? #3

Closed zzj0402 closed 2 years ago

zzj0402 commented 2 years ago
convert_to_text.py
INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image
INFO:multilingual_pdf2text.doc2img.parse_document:Unable to get page count. Is poppler installed and in PATH?
INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR
[]
shahrukhx01 commented 2 years ago

@zzj0402 Hey, which OS are you using? For Linux based distro you can resolve this using the following dependencies installed:

apt install tesseract-ocr
apt install libtesseract-dev
apt-get install poppler-utils