tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 152 forks source link

Tesseract fails to detect letters Å and å in Finnish language. #31

Open jmokoistinen opened 4 years ago

jmokoistinen commented 4 years ago

Testing Tesseract to detect Finnish texts containing "swedish o" -> å. Seems it cannot detect them- Å and å correctly. I have also tried fin+swe model but more usually the fin model version of the text is selected.

Is the previous training files available somewhere? Probably the training data does not have enough Åå cases or it is not included even it is official letter.

stweil commented 4 years ago

See the list of known characters (unicharset). The data for fin in langdata_lstm needs to be fixed. Do you want to send a fix (pull request)?

I move the issue to langdata_lstm.

jmokoistinen commented 4 years ago

Yes, what should i do to make it happen? Collect some data and box them with some tool? where can i get the current data? Cannot see any images here https://github.com/tesseract-ocr/langdata_lstm/tree/master/fin

I guess training is made by synthetic texts with those files? How many examples of å Å there should be? Anything else needs to be modified? Just the training_text singles_text desired characters?(any rules how exactly?)

jmokoistinen commented 4 years ago

Also letters Q and q are missing from the data? There should be all letters at least abcdefghijklmnopqrstuvwxyzåäö ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ 1234567890 How can these be fixed?

I checked the characters through, only Åå and Qq are missing. Is it enough to modify fin.training_text to contain N-amount of missing letters? Or do I need to modify something else?

stweil commented 4 years ago

I'd add all desired characters to desired_characters, ideally sorted with LANG=C.UTF-8 sort. Then we at least have a list of those characters and can try to find training texts which include them sufficiently often.

To fix the problem, we still have to run new training ...