Support for complex names of traineddata (e.g. ita_old.traineddata)

AvtechScientific commented 3 years ago

Some of the traineddata files have complex names, e.g.:

aze_cyrl.traineddata
chi_sim_vert.traineddata
chi_tra_vert.traineddata
ita_old.traineddata
etc.

Right now gImageReader would not recognize, e.g., srp_latn.traineddata as Serbian in the Language/traineddata selection drop-down menu, i.e. it will not be displayed as [srp] srpski -> (and the list of Serbian spell checking dictionaries), but rather it will be displayed undetected as: srp_latn . If you can't choose appropriate spell dictionary - it is not merely an aesthetics problem...

My suggestion is to parse the first part (till the dot) of traineddata file name as follows:

cut the string in two - (a) everything to the left till the last "_" and (b) the rest. So for srp_latn it would be (a) srp and (b) latn; for chi_tra_vert it would be (a) chi_tra and (b) vert.
then treat/display (a) as language code name - i.e. ita - Italian, srp - Serbian chi_tra - Traditional Chinese.
treat/display (b) as additional info along with the language code.

So display the language selection menu as follows:

srp_latn.traineddata [srp] (latn) српски -> ...
ita_old.traineddata [ita] (old) italiano -> ...
chi_tra_vert.traineddata [chi_tra] (vert) 漢語 -> ...

Or, if the above mapping is too complex, then don't use native language names and just do:

srp_latn.traineddata [srp] (latn) Serbian -> ...
ita_old.traineddata [ita] (old) Italian -> ...
chi_tra_vert.traineddata [chi_tra] (vert) Chinese -> ...

The most important part here is that we can get access to the relevant spell dictionaries...

Thank you!

manisandro commented 3 years ago

Should pretty easy to fix, mind giving it a go? I'm still pretty short of time

AvtechScientific commented 3 years ago

actually it's probably will be better to cut the string after the first 3 letters and put the rest in brackets, e.g. [chi] (tra_vert)...

I think I have a bit of spare time right now, so I can give it a try. To this end, could you please:

Merge the outstanding PR so I don't have to rebase afterwards.
Point me to the relevant place in code where to start.

Thank you!

AvtechScientific commented 3 years ago

After taking a look on common/LangTables.hh I understood that the approach is just to list all known traineddata files and not to try to analyze them. So ita_old.traineddata should work actually. So I solved the issue by listing the lacking filenames.

manisandro / gImageReader

Support for complex names of traineddata (e.g. ita_old.traineddata) #537