shahrukhx01 / multilingual-pdf2text

A python library for extracting text from PDFs without losing the formatting of the PDF content.
MIT License
72 stars 11 forks source link

import error #1

Closed dstoekl closed 3 years ago

dstoekl commented 3 years ago

Hi there. on colab seizing: from multilingual_pdf2text.pdf2text import PDF2Text gives:

TypeError Traceback (most recent call last)

in () ----> 1 from multilingual_pdf2text.pdf2text import PDF2Text 2 from multilingual_pdf2text.models.document_model.document import Document 3 import logging 4 logging.basicConfig(level=logging.INFO) 5 2 frames /usr/local/lib/python3.7/dist-packages/multilingual_pdf2text/doc2img/parse_document.py in PDF2Images() 12 self.logger = logging.getLogger(__name__) 13 ---> 14 def convert_document_to_images(self, document: Document) -> list[PpmImageFile]: 15 """ 16 Converts the Document object to TypeError: 'type' object is not subscriptable
shahrukhx01 commented 3 years ago

@dstoekl Thank you for pointing out the issue, I have fixed the bug and released a newer version of the package. Also, I have created this sample notebook on colab where you can test out the library, for using other language you will have to install language pack from tessaract

Notebook: https://colab.research.google.com/drive/1a4lCsxedHGIFgpoyHWYF5v4fLVxKmZpQ?usp=sharing

installing language packs: https://ocrmypdf.readthedocs.io/en/latest/languages.html

dstoekl commented 3 years ago

Great! Very helpful! Many thanks!