tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

Danish traineddata file doesn't include the "@" character #29

Open Furtifk opened 4 years ago

Furtifk commented 4 years ago

Environment

Current Behavior: Danish traineddata file doesn't include the "@" character

Expected Behavior: Danish traineddata file should include the "@" character

Suggested Fix: Danish traineddata file should include the "@" character

File to run OCR on: Screenshot_572

In the case of reproducing I have zip file I can send so you may run a VERY basic test which will display both results comparing eng and dan traineddata results. Please whoever looks into the issue to contact me to receive this.

This is a quite a pressing issue so any response is appreciated.

stweil commented 4 years ago

That's a problem of the model (traineddata), not of Tesseract. See dan.unicharset for a list of supported characters.

If you want, you can send a pull request which fixes the list of desired characters.

stweil commented 4 years ago

There won't be a fixed dan.traineddata soon. I suggest to try Latin.traineddata for your case.

Furtifk commented 4 years ago

@stweil Thanks for the response. I will try the latin traineddata although the document I need to be read cannot yield correct results if I use a combination eng + dan traineddata files so I'm not confident this will work. Getting good OCR results for Danish documents seems to be a hassle when not using the Danish dictionary file.

stweil commented 4 years ago

It is possible to enhance the existing dan.traineddata with missing characters by additional training, so you could try to fix it yourself. Here is a description how this was done for Fraktur. You'll need pairs of line images and text files with a transcription.

Furtifk commented 4 years ago

It is possible to enhance the existing dan.traineddata with missing characters by additional training, so you could try to fix it yourself. Here is a description how this was done for Fraktur. You'll need pairs of line images and text files with a transcription.

Thank you for your response. I do not think this is a viable option for me but thanks for your reply and for the information!

poizan42 commented 4 years ago

It lacks '§' as well which is used in every single legal document in existence...

stweil commented 4 years ago

@Furtifk, @poizan42, especially for older Danish texts you could also try one of the models which I trained recently, for example Fraktur_50000000.502_198857.traineddata.

It was trained based on script/Fraktur with lots of historic documents, and according to my experience it works good although I did not add a dictionary. You will get a warning therefore at runtime, but could add a Danish dictionary if needed.

Furtifk commented 4 years ago

Has there been any improvements recently with the Danish dictionary?

stweil commented 4 years ago

No, and I am afraid there won't be an improvement unless someone works on it.