shahrukhx01 / multilingual-pdf2text

A python library for extracting text from PDFs without losing the formatting of the PDF content.
MIT License
72 stars 11 forks source link

Does not have support for windows? #2

Closed ghost closed 2 years ago

ghost commented 2 years ago

Hi, first of all the library is really good.

I tried to run this library on windows 10 and it doesn't work. I believe I did everything right, installed Tesseract and ran the following code:

from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging

from utils import write_txt

logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(document_path="./pdfs_samples/page1.pdf", language="por")
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    for page in content:
        print(page["text"])
        write_txt(page["text"], filename="output_multilingual_pdf2text1.txt")

if __name__ == "__main__":
    main()

I ran this same code on linux(ubuntu 20.04) and it worked perfectly. So, was wondering if the library doesn't support windows?

shahrukhx01 commented 2 years ago

@richecr As long as you are able to install Tessaract on Windows this library would work fine. You can take a look at this article Installing and using Tesseract 4 on windows 10