Open aslamy opened 5 years ago
From https://github.com/tesseract-ocr/tesseract/issues/2075:
It's also possible to use script/Latin for Swedish. That should contain all characters.
Only symbols included in swe.unicharset will be detected during OCR. If a symbol is missing, it can be added by fine tuning training.
Adding symbols to the desired_characters
files helps for future trainings, so symbols won't be missed then, but does not change existing models.
The desired_characters
file is used for the training done by Google. The tesseract training tools which are available in https://github.com/tesseract-ocr/tesseract do not use it.
@amitdo should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? Is there any easier way? A training GUI for tesseract 4?
should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ?
That supposed to be the way... but it's not so easy.
Is there any easier way? A training GUI for tesseract 4?
I don't know.
The current danish traineddata has the same issue. Really danish should be exactly the same as swedish except for ö->ø and ä->æ (I see that specifically '@' was added recently to desired_characters, but no new training data generated).
@poizan42, I suggest to create a pull request which adds the missing characters to the list of desired characters.
You can try the script/Latin model which should already support all Danish characters, or you could enhance the existing dan.traineddata, either by fine-tuning (see link above) or by using tesstrain. I prefer tesstrain because I found it easier to use.
@stweil, I have created a PR in #34
I merged that PR now, thanks. Please note that we cannot expect new training done by Google, so it is up to the Open Source community (= you, me, ...) to use the fixed information and train new models.
The file desired_characters does not contains many of the important special characters like "@". All special characters in english is also important for swedish language. Law documents contains section sign § character. Please add this as well.