Thank you for making this awesome library.i am trying to make a bengali tafsir reader using your repository.
here is the code that i tried in colab:
!pip install gTTS
#!pip install PyPDF2
!pip install playsound
!pip install multilingual-pdf2text==1.1.0
!apt install tesseract-ocr
!apt install libtesseract-dev
!apt-get install poppler-utils
!apt-get install tesseract-ocr-ara
!apt-get install tesseract-ocr-ben
from multilingual_pdf2text.pdf2text import PDF2Text
from multilingual_pdf2text.models.document_model.document import Document
import logging
logging.basicConfig(level=logging.INFO)
def main():
## create document for extraction with configurations
pdf_document = Document(
document_path='/content/tafsir.pdf',
language='ben'
)
pdf2text = PDF2Text(document=pdf_document)
content = pdf2text.extract()
for page in content:
print(page['text'])
if __name__ == "__main__":
main()
it takes a lot of time and basically is stuck after printing this :
INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image
INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR
Thank you for making this awesome library.i am trying to make a bengali tafsir reader using your repository. here is the code that i tried in colab:
it takes a lot of time and basically is stuck after printing this :
INFO:multilingual_pdf2text.doc2img.parse_document:Parsing document from pdf to image INFO:multilingual_pdf2text.ocr.image_to_text:Extracting text from images via OCR
and after few minutes colab will crash,,seems like after exhausting all available ram of colab,the notebook gets crashed. the pdf book that i am trying to read using this library is written in bangla and arabic.here is the link of that pdf book : https://i-onlinemedia.net/downloads/books/quran-tafsir/tafsir_ibn_kasir/Tafsir_Ibn_Kasir_Part-1-2-3.pdf