manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.63k stars 190 forks source link

Support for complex names of traineddata (e.g. ita_old.traineddata) #537

Closed AvtechScientific closed 3 years ago

AvtechScientific commented 3 years ago

Some of the traineddata files have complex names, e.g.:

Right now gImageReader would not recognize, e.g., srp_latn.traineddata as Serbian in the Language/traineddata selection drop-down menu, i.e. it will not be displayed as [srp] srpski -> (and the list of Serbian spell checking dictionaries), but rather it will be displayed undetected as: srp_latn . If you can't choose appropriate spell dictionary - it is not merely an aesthetics problem...

My suggestion is to parse the first part (till the dot) of traineddata file name as follows:

  1. cut the string in two - (a) everything to the left till the last "_" and (b) the rest. So for srp_latn it would be (a) srp and (b) latn; for chi_tra_vert it would be (a) chi_tra and (b) vert.
  2. then treat/display (a) as language code name - i.e. ita - Italian, srp - Serbian chi_tra - Traditional Chinese.
  3. treat/display (b) as additional info along with the language code.

So display the language selection menu as follows:

Or, if the above mapping is too complex, then don't use native language names and just do:

The most important part here is that we can get access to the relevant spell dictionaries...

Thank you!

manisandro commented 3 years ago

Should pretty easy to fix, mind giving it a go? I'm still pretty short of time

AvtechScientific commented 3 years ago
  1. actually it's probably will be better to cut the string after the first 3 letters and put the rest in brackets, e.g. [chi] (tra_vert)...

I think I have a bit of spare time right now, so I can give it a try. To this end, could you please:

  1. Merge the outstanding PR so I don't have to rebase afterwards.
  2. Point me to the relevant place in code where to start.

Thank you!

AvtechScientific commented 3 years ago

After taking a look on common/LangTables.hh I understood that the approach is just to list all known traineddata files and not to try to analyze them. So ita_old.traineddata should work actually. So I solved the issue by listing the lacking filenames.