tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.74k stars 9.34k forks source link

Specifying tessedit_char_whitelist causes whole word to dissapear. #3485

Open TrueWodzu opened 3 years ago

TrueWodzu commented 3 years ago

Current Behavior:

Normally when I use tesseract on my image, without specifying tessedit_char_whitelist I am getting result: "389." (without double quotes). I wanted to remove dot as I am only interested in numbers and dot does not exist on my image. So I've specified a whitelist as follows:

    if (!tesseract.SetVariable("tessedit_char_whitelist", "0123456789"))
      throw "Unable to set tessedit_char_whitelist.";

After this change tesseract returns me an empty string.

Expected Behavior:

I would expect to get string which contains only whitelisted characters, in my case that would be "389"

BTW: I think libtesseract.so name is wrong for the version 4.1.1, currently it is libtesseract.so.4.0.1 and it should be libtesseract.so.4.1.1

amitdo commented 2 years ago

@bertsky, maybe you can help here?

bertsky commented 2 years ago

I can try. This comes up again and again. Unfortunately, whitelisting (and also pattern matching) was not given much thought in the LSTM implementation. (In fact, it did not work at all in 4.0 – only for legacy models.) The CTC decoder beam is too narrow, so usually not enough alternative hypotheses survive. You should be able to get something useful by setting lstm_choice_mode=2 and lstm_choice_iterations=5 (or larger) – but IIRC this will only work on traineddata with dictionaries (like the stock models, but not on tesstrain models).

amitdo commented 2 years ago

This comes up again and again

True.

If it does not work well, maybe we should disable this feature for LSTM?

bertsky commented 2 years ago

If it does not work well, maybe we should disable this feature for LSTM?

I'd recommend against that, though. There might still be workable setups, like mixing LSTMs and non-LSTMs...