mittagessen / kraken

OCR engine for all the languages
http://kraken.re
Apache License 2.0
747 stars 131 forks source link

font recognition? #378

Closed jsbien closed 2 years ago

jsbien commented 2 years ago

Do I guess correctly that kraken, similarily as the recent versions of tesseract, is unable to distinguish e.g. bold, roman and italics?

mittagessen commented 2 years ago

Out of the box that is indeed correct. Some people have trained models that do this kind of text style analysis in conjunction with text recognition and it works quite well but you'd have to prepare a dataset yourself for this purpose. The basic idea is to add an additional token for each style to be recognized and use them as markers in the ground truth. From others' experiments per-word style tokens seems to work best, e.g. "sample text for demonstration" -> "sample $text for #demonstration".

jsbien commented 2 years ago

Thanks for your quick answer! Can you be more specific about "Some people"? :-)

mittagessen commented 2 years ago

@alix-tz was one of them if I remember correctly. There was also a Catalan dictionary project of whose results/approach I've heard only heard of indirectly (I can't find their name right now). I had a poster a couple of years ago evaluating a basic duplication approach (one separate label for each character and its style variants) but for word-segmenting languages style-indicating markers on the word level seem to work better.

jsbien commented 2 years ago

Thanks again.

PonteIneptique commented 2 years ago

Hey, @gabays cocreated a few datasets around this question. See https://hal.archives-ouvertes.fr/hal-03355683

Le mar. 9 août 2022 à 6:20 AM, Janusz S. Bień @.***> a écrit :

Thanks again.

— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/378#issuecomment-1208894274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZVAVGMZDDT65VDEPLLVYHMCVANCNFSM554GEQQA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jsbien commented 2 years ago

Thanks for the link!

alix-tz commented 2 years ago

Hi!

In deed, I used special characters to:

I don't have datasets handy because it was an experimentation and it was not implemented in the end. But it did work!

image

would be transcribed such as each word underscored is preceded with a "_".

The quick _brown _fox _jumps _over the lazy dog

We also used "^" (although we will probably switch to another character) to markup superscripted texts, which in some way is an adaptation of this mechanism.

jsbien commented 2 years ago

Thanks for all the answers! Let me present briefly my motivation and limitations. I would like to make some experiments with Słownik Geograficzny Królestwa Polskiego. Some small fragments has been already transcribed and can serve as Ground Truth. Actually my dream is to make high quality OCR of Linde's dictionary but this task is much more difficult because it is multilingual. In both dictionaries the information about font is very important. I retired several year ago so I have to work single-handed or to try to organize crowdsourcing. I'm not a programmer so I have to use the existing tools,