tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.75k stars 9.35k forks source link

Feature Request: ALTO output - add support for LANG attribute in TextBlock/TextLine elements #4046

Open filak opened 1 year ago

filak commented 1 year ago

Your Feature Request

It might be relatively simple to do this by looking at the hocrrenderrer https://github.com/tesseract-ocr/tesseract/blob/5f297dc0b8b500d57b7c073f4457e74ee537819f/src/api/hocrrenderer.cpp#L243

paragraph_lang = res_it->WordRecognitionLanguage();
if (paragraph_lang) {
  hocr_str << " lang='" << paragraph_lang << "'";
}

https://github.com/tesseract-ocr/tesseract/blob/5f297dc0b8b500d57b7c073f4457e74ee537819f/src/api/hocrrenderer.cpp#L302

const char *lang = res_it->WordRecognitionLanguage();
if (lang && (!paragraph_lang || strcmp(lang, paragraph_lang))) {
  hocr_str << " lang='" << lang << "'";
}

It could be adapted in altorenderer

https://github.com/tesseract-ocr/tesseract/blob/424b17f997363670d187f42c43408c472fe55053/src/api/altorenderer.cpp#L215

ie.

    if (res_it->IsAtBeginningOf(RIL_PARA)) {
      alto_str << "\t\t\t\t\t<TextBlock ID=\"block_" << tcnt << "\"";
      AddBoxToAlto(res_it, RIL_PARA, alto_str);
      paragraph_lang = res_it->WordRecognitionLanguage();
      if (paragraph_lang) {
        alto_str << " LANG='" << paragraph_lang << "'";
      }
      alto_str << "\n";
    }

    if (res_it->IsAtBeginningOf(RIL_TEXTLINE)) {
      alto_str << "\t\t\t\t\t\t<TextLine ID=\"line_" << lcnt << "\"";
      AddBoxToAlto(res_it, RIL_TEXTLINE, alto_str);
      const char *lang = res_it->WordRecognitionLanguage();
      if (lang && (!paragraph_lang || strcmp(lang, paragraph_lang))) {
        alto_str << " LANG='" << lang << "'";
      }
      alto_str << "\n";
    }

The lang codes shall be converted from Tesseract codes to standard 2-letter codes.

A mapping structure needs to be created (I have done the mapping before codes_lookup.xml but it definitely must be updated) which can be used in a function ie.

 alto_str << " LANG='" << GetLangCodeForAlto(lang) << "'";

I can create the mapping file but I do not feel competent doing the coding.

stweil commented 1 year ago

The ALTO specification says "Attribute to record language of the string. The language should be recorded at the highest level possible." So an implementation must not set LANG for textlines when all lines in a textblock have the same language.

And there is another problem. Strictly speaking Tesseract does not detect the language of a text. It uses models for the recognition. Some of those models include a dictionary for a certain language and are named using 3-letter ISO codes. But even if the text was detected by eng.traineddata that does not always mean that the detected text is English.

How would we handle a typical case where a self-trained model without dictionary or a script model like Latin.traineddate was used?

Would LANG be set as expected when Tesseract was called with more than one language model?

filak commented 1 year ago

My point is that Tesseract outputs language info into hocr but in alto there is none.

There is some conditional logic - if there is paragraph_lang => no lang output for TextLine. Is it sufficient to satisfy the "highest level possible" requirement ?

The auto mapping seems overkill. What if it is left for the user to decide what value will go into the LANG attribute(s) by using some optional parameter ?

ie.

 tesseract input.tiff output -l eng --altolang en