Closed jsbien closed 2 years ago
Out of the box that is indeed correct. Some people have trained models that do this kind of text style analysis in conjunction with text recognition and it works quite well but you'd have to prepare a dataset yourself for this purpose. The basic idea is to add an additional token for each style to be recognized and use them as markers in the ground truth. From others' experiments per-word style tokens seems to work best, e.g. "sample text for demonstration" -> "sample $text for #demonstration".
Thanks for your quick answer! Can you be more specific about "Some people"? :-)
@alix-tz was one of them if I remember correctly. There was also a Catalan dictionary project of whose results/approach I've heard only heard of indirectly (I can't find their name right now). I had a poster a couple of years ago evaluating a basic duplication approach (one separate label for each character and its style variants) but for word-segmenting languages style-indicating markers on the word level seem to work better.
Thanks again.
Hey, @gabays cocreated a few datasets around this question. See https://hal.archives-ouvertes.fr/hal-03355683
Le mar. 9 août 2022 à 6:20 AM, Janusz S. Bień @.***> a écrit :
Thanks again.
— Reply to this email directly, view it on GitHub https://github.com/mittagessen/kraken/issues/378#issuecomment-1208894274, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZVAVGMZDDT65VDEPLLVYHMCVANCNFSM554GEQQA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Thanks for the link!
Hi!
In deed, I used special characters to:
I don't have datasets handy because it was an experimentation and it was not implemented in the end. But it did work!
would be transcribed such as each word underscored is preceded with a "_
".
The quick _brown _fox _jumps _over the lazy dog
We also used "^
" (although we will probably switch to another character) to markup superscripted texts, which in some way is an adaptation of this mechanism.
Thanks for all the answers! Let me present briefly my motivation and limitations. I would like to make some experiments with Słownik Geograficzny Królestwa Polskiego. Some small fragments has been already transcribed and can serve as Ground Truth. Actually my dream is to make high quality OCR of Linde's dictionary but this task is much more difficult because it is multilingual. In both dictionaries the information about font is very important. I retired several year ago so I have to work single-handed or to try to organize crowdsourcing. I'm not a programmer so I have to use the existing tools,
Do I guess correctly that kraken, similarily as the recent versions of tesseract, is unable to distinguish e.g. bold, roman and italics?