shahrukhx01 / multilingual-pdf2text

A python library for extracting text from PDFs without losing the formatting of the PDF content.
MIT License
72 stars 11 forks source link

content = pdf2text.extract() taking a lot of time before crashing colab #4

Closed mobassir94 closed 2 years ago

mobassir94 commented 2 years ago

Thank you for making this awesome library.i am trying to make a bengali tafsir reader using your repository. here is the code that i tried in colab:

!pip install gTTS
#!pip install PyPDF2
!pip install playsound
!pip install multilingual-pdf2text==1.1.0
!apt install tesseract-ocr
!apt install libtesseract-dev
!apt-get install poppler-utils 

!apt-get install tesseract-ocr-ara
!apt-get install tesseract-ocr-ben

from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)

def main():
    ## create document for extraction with configurations
    pdf_document = Document(
        document_path='/content/tafsir.pdf',
        language='ben'
        )
    pdf2text = PDF2Text(document=pdf_document)
    content = pdf2text.extract()
    for page in content:
      print(page['text'])

if __name__ == "__main__":
    main()

it takes a lot of time and basically is stuck after printing this :

INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR

and after few minutes colab will crash,,seems like after exhausting all available ram of colab,the notebook gets crashed. the pdf book that i am trying to read using this library is written in bangla and arabic.here is the link of that pdf book : https://i-onlinemedia.net/downloads/books/quran-tafsir/tafsir_ibn_kasir/Tafsir_Ibn_Kasir_Part-1-2-3.pdf

mobassir94 commented 2 years ago

working just fine in kaggle : https://www.kaggle.com/mobassir/bengali-tafsir-ibn-kathir-pdf2text?scriptVersionId=86186646