jlsutherland / doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.
MIT License
1.27k stars 98 forks source link

Maybe a stupid question about the api, can't find in source code #31

Closed LongxingTan closed 4 years ago

LongxingTan commented 4 years ago

Thanks for the nice code.

Just a question about the code, because i see the examples to use

doc = doc2text.Document()

# You can pass the lang (as 3 letters code) to the class to improve accuracy
# On ubuntu it requires the package tesseract-ocr-$lang$
# On other OS, see https://github.com/tesseract-ocr/langdata
doc = doc2text.Document(lang="eng")

# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('./path/to/my/file')

# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()

# Extract text from the pages.
doc.extract_text()
text = doc.get_text()

but when i try to find the api like .process, . read, i can't find them in source. Any suggestion on this? Thanks