tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 153 forks source link

Missing many special characters in desired_characters file (Swedish) #4

Open aslamy opened 5 years ago

aslamy commented 5 years ago

The file desired_characters does not contains many of the important special characters like "@". All special characters in english is also important for swedish language. Law documents contains section sign § character. Please add this as well.

stweil commented 5 years ago

From https://github.com/tesseract-ocr/tesseract/issues/2075:

It's also possible to use script/Latin for Swedish. That should contain all characters.

stweil commented 5 years ago

Only symbols included in swe.unicharset will be detected during OCR. If a symbol is missing, it can be added by fine tuning training.

Adding symbols to the desired_characters files helps for future trainings, so symbols won't be missed then, but does not change existing models.

amitdo commented 5 years ago

The desired_characters file is used for the training done by Google. The tesseract training tools which are available in https://github.com/tesseract-ocr/tesseract do not use it.

Kalle12345 commented 5 years ago

@amitdo should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ? Is there any easier way? A training GUI for tesseract 4?

amitdo commented 5 years ago

should I then use https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ?

That supposed to be the way... but it's not so easy.

Is there any easier way? A training GUI for tesseract 4?

I don't know.

poizan42 commented 4 years ago

The current danish traineddata has the same issue. Really danish should be exactly the same as swedish except for ö->ø and ä->æ (I see that specifically '@' was added recently to desired_characters, but no new training data generated).

stweil commented 4 years ago

@poizan42, I suggest to create a pull request which adds the missing characters to the list of desired characters.

You can try the script/Latin model which should already support all Danish characters, or you could enhance the existing dan.traineddata, either by fine-tuning (see link above) or by using tesstrain. I prefer tesstrain because I found it easier to use.

poizan42 commented 4 years ago

@stweil, I have created a PR in #34

stweil commented 4 years ago

I merged that PR now, thanks. Please note that we cannot expect new training done by Google, so it is up to the Open Source community (= you, me, ...) to use the fixed information and train new models.