tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

use norm_mode 1 as default #254

Open bertsky opened 3 years ago

bertsky commented 3 years ago

Not sure if this is related to #53: why does the current default NORM_MODE set 2 for non-Indic, non-RTL languages? Shouldn't this be 1?

Also, the decision tree looks quite different than the corresponding one in tesseract/src/training/language-specific.sh. Does anyone know how that came to be?

bertsky commented 3 years ago

Plus (just to be sure): Am I correct in assuming that under 2, combining characters get recoded as extra symbol, whereas under 1 they are merged with the base character?

bertsky commented 3 years ago

Decision seems to derive from here: c90cd3f27acbacc8d30db1b44d1c017aecc7bf20

@wrznr could you please elaborate on the kind of feedback you gave (or link to it)?

bertsky commented 3 years ago

@wrznr could you please elaborate on the kind of feedback you gave (or link to it)?

answer (on other channel): here – a simple question.

IMHO the response should have been to rethink the old default in tesstrain, not immediately revert to it.

As stated above, Tesseract's own default used to be 1 in src/training/language-specific.sh. But that file (along with all other shell scripts) has very recently been removed from tesseract by @stweil. It now resides here:

https://github.com/tesseract-ocr/tesstrain/blob/1d8238684fe81e600431e5bdfe7dd24fbeaaf9f9/src/training/language_specific.py#L1373

So, again, why not 1 by default, and is my interpretation regarding combining characters correct?

bertsky commented 3 years ago

@Shreeshrii it seems the original deviation regarding --norm_mode default came from changes proposed by you (introducing finetuning here). Could you please elaborate on your choice?

wrznr commented 3 years ago

@bertsky It looks like the initial PR by @kba and myself was not prepared carefully enough. We took norm_mode for granted where we should have inspected its semantics in greater detail. When @Shreeshrii later tried to correct this with the option for setting the norm_mode in a sensible way, some misunderstandig occurred leading to the current suboptimal setting. It would be great if we could fix this now that you had have the chance to dive deeper into the consequences of this parameter setting.

bertsky commented 3 years ago

Sure, I'll prepare a PR, but will first do some (training and) testing.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

giri-kum commented 2 years ago

Sure, I'll prepare a PR, but will first do some (training and) testing.

@bertsky Did you find anything different in testing? I see that the Norm Mode is still 2 for non-indic and non-RTL languages one year after this conversation.

bertsky commented 2 years ago

@giri-kum sry, I don't remember anymore. I think I had some results, but inconclusive due to other problems. Isolated experiment should not be difficult to set up (run 10-20 trainings with mode 1 and mode 2, evaluate valset with external, true CER measurement, compare averages, perhaps repeat with different GT sets or languages), but I don't have the time right now.