Pytesseract stuff should not be inside imagetranslation.py file

Currently we process the image using pytesseract right inside the imagetranslation.py file.

    pytesseract_config = r"--oem 3 --psm 5 -l jpn_vert"
    ...
        current_text = ""
        if is_lang_vertical_lang(src_lang):
            current_text = pytesseract.image_to_string(
                cropped_section, config=pytesseract_config
            )
        current_text = remove_trailing_whitespace(current_text).replace(
            " ", ""
        )

        if current_text == "":
            continue
    ...

However, this has the following issues that needs to be fixed:

If the source language is not japanese, then the pytesseract_config is not valid.
If the source language is not part of vertical language, we aren't even extracting text from it

Furthermore, the imagetranslation.py file is too big since it contains too much unnecessary logics that can be subdivided into other modules. Therefore, the pytesseract handling should be exported into a seperate file called textextraction.py inside ./src/modules. The file should implement the main method

def extract_text (image: np.array, src_lang: str) -> str:
  """add documentation"""
  pytesseract_config, success = generate_config(src_lang)

  content = ""
  if success:
    content = pytesseract.image_to_string(image, config=pytesseract_config)
  return remove_trailing_whitespace(content).replace(" ", "")

such that the imagetranslation.py file now will be call it as such

...
from src.modules.textextraction import extract_text
...

def translate(...):
  ...
        current_text = extract_text(cropped_section, src_lang)
        if current_text == "":
            continue
  ...

maeriil / Aoriil

Pytesseract stuff should not be inside imagetranslation.py file #18