tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
629 stars 182 forks source link

deu_latf wordfile #383

Open Stond0cyborg opened 6 months ago

Stond0cyborg commented 6 months ago

I tried to recognize some old Fraktur texts with deu_latf, but there are many words that are not recognized correctly, so I extracted the word list from deu_latf. This file seems to use word recognition Example: A-{d}-{cd°s}%- A-{d}-{cd°a}% A-{d}-{c-% A-{d}s§gi I then extracted the readable version and realized that a lot more words (recognitions) could be added. I would also like to try to improve the problem with the recognition of "ich, schon ,noch" etc. to improve it. Because, with "bat ned) " (hat noch) "bod)" (doch) you can not do much.

Is there a README file for this file or another explanation to extend it?

stefan6419846 commented 6 months ago

The corresponding training data is available at https://github.com/tesseract-ocr/langdata_lstm/tree/main/deu_latf For the basic meaning of the files, see https://groups.google.com/g/tesseract-ocr/c/U9mysQuhRpU/m/7aNrZACXBQAJ for example.

stweil commented 6 months ago

Don't use deu_latf for Fraktur. Try https://zenodo.org/records/10125246 instead.

More models here: https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/. My latest models for historic texts are called "german_print*".

stweil commented 6 months ago

See also https://ocr-bw.bib.uni-mannheim.de/faq/ (German).

Stond0cyborg commented 6 months ago

Dann bedanke ich mich recht herzlich, Herr Weil und wünsche weiterhin viel Erfolg mit ihrem Programm! ;)