maeriil / Aoriil

Image translator designed for manga but can be extended to any websites in general.
MIT License
0 stars 0 forks source link

Pytesseract stuff should not be inside imagetranslation.py file #18

Open maeriil opened 1 year ago

maeriil commented 1 year ago

Currently we process the image using pytesseract right inside the imagetranslation.py file.

    pytesseract_config = r"--oem 3 --psm 5 -l jpn_vert"
    ...
        current_text = ""
        if is_lang_vertical_lang(src_lang):
            current_text = pytesseract.image_to_string(
                cropped_section, config=pytesseract_config
            )
        current_text = remove_trailing_whitespace(current_text).replace(
            " ", ""
        )

        if current_text == "":
            continue
    ...

However, this has the following issues that needs to be fixed:

Furthermore, the imagetranslation.py file is too big since it contains too much unnecessary logics that can be subdivided into other modules. Therefore, the pytesseract handling should be exported into a seperate file called textextraction.py inside ./src/modules. The file should implement the main method

def extract_text (image: np.array, src_lang: str) -> str:
  """add documentation"""
  pytesseract_config, success = generate_config(src_lang)

  content = ""
  if success:
    content = pytesseract.image_to_string(image, config=pytesseract_config)
  return remove_trailing_whitespace(content).replace(" ", "")

such that the imagetranslation.py file now will be call it as such

...
from src.modules.textextraction import extract_text
...

def translate(...):
  ...
        current_text = extract_text(cropped_section, src_lang)
        if current_text == "":
            continue
  ...