uhh-lt / amharicprocessor

Amharic Segmenter and tokenizer
MIT License
7 stars 4 forks source link

Add support for Amharic language OCR #3

Closed lewiEyasu closed 1 year ago

lewiEyasu commented 1 year ago

The AmharicOCR class is a Python class that performs Optical Character Recognition (OCR) on Amharic PDF files using Tesseract OCR. The class extracts text from each page of the input PDF file, converts it to grayscale, and uses Tesseract OCR to extract the text from the grayscale image.