jwilk-archive / ocrodjvu

OCR for DjVu
GNU General Public License v2.0
44 stars 19 forks source link

please add 'tesseract: ' prefix to Tesseract's stderr #10

Closed jwilk closed 9 years ago

jwilk commented 10 years ago

Issue reported by @jsbien:

ocrodjvu -e tesseract -l deu-frak+pol --save-raw-ocr=hocr -o 0184ocr.djvu iLinde1FR11p0184.djvu Processing 'iLinde1FR11p0184.djvu':

...

Tesseract Open Source OCR Engine v3.03 with Leptonica

Using default language params

The message suggests that the language parameters deu-frak and pol has been ignored, but hOCR contains the lang tags applied more or less correctly.

The file used for the test is saved from

http://eprints.wbl.klf.uw.edu.pl/44/1/iLinde1FR11.djvu?djvuopts=&page=184&zoom=width&showposition=0.5,0.2

This is ocrodjvu 0.7.17 on Debian jessie.

Regards

JSB

jwilk commented 10 years ago

For the reference, this is the code that emits this message:

#!c++
  if (!sub_langs_.empty()) {
    // In multilingual mode word ratings have to be directly comparable,
    // so use the same language model weights for all languages:
    // use the primary language's params model if
    // tessedit_use_primary_params_model is set,
    // otherwise use default language model weights.
    if (tessedit_use_primary_params_model) {
      for (int s = 0; s < sub_langs_.size(); ++s) {
        sub_langs_[s]->language_model_->getParamsModel().Copy(
            this->language_model_->getParamsModel());
      }
      tprintf("Using params model of the primary language\n");
      if (tessdata_manager_debug_level)  {
        this->language_model_->getParamsModel().Print();
      }
    } else {
      this->language_model_->getParamsModel().Clear();
      for (int s = 0; s < sub_langs_.size(); ++s) {
        sub_langs_[s]->language_model_->getParamsModel().Clear();
      }
      tprintf("Using default language params\n");
    }
  }

(copied from https://sources.debian.net/src/tesseract/3.03.03-1/ccmain/tessedit.cpp?hl=358#L338)

I'm not sure what the message exactly means.

jwilk commented 10 years ago

Comment submitted by @jsbien:

I can ask on the tesseract list.

jwilk commented 10 years ago

Comment submitted by @jsbien:

https://groups.google.com/forum/#!topic/tesseract-ocr/Fmz9cPPqb6k

jwilk commented 9 years ago

I'm not sure there's much ocrodjvu can do here. I'm leaning towards closing this bug without any action.

jwilk commented 9 years ago

Comment submitted by @jsbien:

I understand there is no easy way to mark this message as coming from tesseract, so perhaps just mention it in the documentation?

jwilk commented 9 years ago

Actually I could easily add the tesseract: prefix to everything Tesseract prints on stderr.

Would this make you satisfied?

jwilk commented 9 years ago

Comment submitted by @jsbien:

Yes.

jwilk commented 9 years ago

Implemented in 71daffaa2fab018d327fa9aaefbf9a1a4149a9f9.

jwilk commented 9 years ago

Fixed in ocrodjvu 0.8.