jlsutherland / doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.
MIT License
1.27k stars 97 forks source link

Add supports for lang parameter #18

Closed rcatajar closed 8 years ago

rcatajar commented 8 years ago

This allow to initialize the Document class with a lang that will be passed to tesseract. (Giving tesseract a language sometimes greatly improve text extraction quality).

On ubuntu this requires to install the package tesseract-ocr-$lang$ where $lang$ is the 3 letter code for the language. On other OS, lang data for tesseract can be found at https://github.com/tesseract-ocr/langdata

rcatajar commented 8 years ago

@jlsutherland Sorry, my branch also include my commit for #17 ... The only relevant commit here is 5d9b0a8

jlsutherland commented 8 years ago

This is a nice feature add. Thanks @rcatajar!