subha-v / StorybookImages

Translates storybooks from English to low-resource languages
1 stars 1 forks source link

Pytesseract Issues #8

Open subha-v opened 2 years ago

subha-v commented 2 years ago

One issue we're having is the pytesseract module saying that it's uninstalled even after doing pip install pytesseract. This creates a problem in the OCR. Please try to replicate this by running TextExtraction.py and seeing where it says pytesseract not installed.

leonardomarcao commented 2 years ago

@subha-v for fix this only makes:

for linux only

sudo apt-get install python3-pil tesseract-ocr libtesseract-dev tesseract-ocr-eng tesseract-ocr-script-latn

for windows only

  1. Make a download of tesseract and install this https://github.com/UB-Mannheim/tesseract/wiki
  2. Add Tesseract path to your System Environment.
  3. Add this line to your python script every time pytesseract.pytesseract.tesseract_cmd = 'C:/OCR/Tesseract-OCR/tesseract.exe' # your path may be different

after steps above, I get the following result:

image

ps: I editted my local code the line 56 because I was getting ValueError: too many values to unpack exception.

Let me know if work for you.

subha-v commented 2 years ago

Hi Leonardo, thank you. I'm currently using a mac, so I'm not sure if these exact instructions could help. Do you have any advice for that?

leonardomarcao commented 2 years ago

Did you had installed tesseract using on your mac? Using the following command:

brew install tesseract

If this yes and this don't work yet, I suggest that we mount a docker image to resolve this problem.