tberg12 / ocular

Ocular is a state-of-the-art historical OCR system.
GNU General Public License v3.0
250 stars 48 forks source link

Training for Finnish (containing also letters ä ö å Ä Ö Å) #4

Closed jmokoistinen closed 8 years ago

jmokoistinen commented 8 years ago

I created model for Finnish

  1. initializing language model for 1 language (Finnish) and
  2. initializing a font for (537939 words in training data)
  3. training font with only 3 training images results this kinds of texts.
  4. The results from the same 3 images. look weird?

What could be wrong?

eval_diplomatic.txt

Document: sample_images/finnish/1.jpg CER, keep punc: 0.974025974025974 CER, keep punc, allow f->s: 0.974025974025974 CER, remove punc: 0.9970326409495549 CER, remove punc, allow f->s: 0.9970326409495549 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 1.0 WER, remove punc, allow f->s: 1.0

Document: sample_images/finnish/2.jpg CER, keep punc: 0.98094688221709 CER, keep punc, allow f->s: 0.98094688221709 CER, remove punc: 0.9940369707811568 CER, remove punc, allow f->s: 0.9940369707811568 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 1.0 WER, remove punc, allow f->s: 1.0

Document: sample_images/finnish/3.jpg CER, keep punc: 0.9778783308195073 CER, keep punc, allow f->s: 0.9778783308195073 CER, remove punc: 0.9947451392538098 CER, remove punc, allow f->s: 0.9947451392538098 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 0.996309963099631 WER, remove punc, allow f->s: 0.996309963099631

Macro-avg total eval: CER, keep punc: 0.9776170623541904 CER, keep punc, allow f->s: 0.9776170623541904 CER, remove punc: 0.9952715836615071 CER, remove punc, allow f->s: 0.9952715836615071 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 0.998769987699877 WER, remove punc, allow f->s: 0.998769987699877

1_transcription.txt

uºгp

(

П

(
(
(

(

(


m
A
(
(
(
(
(
(

tºг⅙

1_comparisons.txt MD: Model diplomatic transcription GD: Gold diplomatic transcription

MD: GD: my\"oten erinnyt, yksi osa seurannut Kamajokea pohjaseen p\"ain,

MD: .őд1 GD: toinen Wolgajokea l\"ansiluoteesen ja kolmas wasta nimitetty\"a

MD: ú GD: jokea Kaspiamereen p\"ain. Ensimm\"ainen osa, Karjalai- ....

Thanks, Mika Koistinen

dhgarrette commented 8 years ago

The use of diacritics shouldn't make a difference; we use them without issues.

When the model outputs total junk, it usually means that the image is too hard to read. If you use an -extractedLinesPath, it will write out the binarized images that it's actually using. Check those and see if its coming out very light or very dark, and adjust the -binarizeThreshold if necessary.

Otherwise, maybe send me the image files and LM training data and I'll have a look?

dhgarrette commented 8 years ago

Closing this issue since I haven't heard back.

I was able to get it working by initializing a new LM with a more succinct character set (trained from some Project Gutenberg books). If the model has access to too many very-low-frequency characters, EM gets confused because it thinks it needs to give probability mass to them, when in reality it would do better just to ignore them completely.

My model used:

[ , !, ", &, ', (, ), *, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, ?, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, [, \"A, \"O, \"a, \"e, \"o, \"u, \'a, \'e, \'o, \^a, \`a, \`e, \cc, ], _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, §, ¶, Å, å]

finnish3.lmser contains:

[ , !, ", &, ', (, ), *, +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, =, ?, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, R, S, T, U, V, W, X, Y, Z, [, \"A, \"O, \"a, \"o, \"u, \'a, \'e, \`a, ], a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, §, «, ¶, », ¼, ½, ¾, Å, å, æ, ˮ, ˶, а, в, е, и, к, л, м, н, о, р, у, ъ, —, ’, “, ”, „, †, ⅓]

Try initializing a new LM, or increasing the character count threshold using the -minCharCount option on InitializeLanguageModel.

MilosStanic commented 6 years ago

Hello,

I don't want to open a new issue since I'm having a similar problem. First of all, I must say I'm not a Java guy. Never used it before. I'm more of a PHP/JavaScript guy. Second, I'm a total newbie at machine learning. Just started an online course at Udemy.

However, I needed to use ocular to OCR an old dictionary of Serbian language. It has 6 volumes and about 5500 pages. The source is a batch of PDFs with scanned dictionary which I then split into .png pages. The problem is that this was a phototypic reprint of the original dictionary. The quality of the scan is good, but the original was very bad because of ink smudges. Here's a sample page: rms1-0830

The other problem is the very type of text. Dictionary uses accented vowels to denote pronunciation of words. These accented vowels are not present in ordinary Serbian language texts that I can use to initialize a language model and train a font. One more problem is that italic letters are quite different from the normal letters in the cyrillic alphabet.

Seven years ago I used Abby Fine Reader, and it produced pretty decent results, however, the main thing, the dictionary words (lemmas) were not recognized correctly due to accent marks.

Now, what I'm getting form oculus is this:

 _________________________________________________________________________________į”,
џ                                   
(                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
Ш_—Yв                                 
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
И                                   
*                                   
 Ш—YJ                                 
*                                   

Which is no way near perfect. My language model included following symbols:

Loading initial LM from lm/serbian.lmser
Loaded CodeSwitchLanguageModel from lm/serbian.lmser
    NoLanguageNameGiven: [ , !, ", &, ', (, ), *, +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, =, ?, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, R, S, T, U, V, W, X, Y, Z, [, ], _, a, á, ä, b, c, d, e, é, ë, f, g, h, i, í, j, k, l, m, n, o, ó, p, q, r, s, t, u, ü, v, w, x, y, z, §, °, ¶, ć, č, ř, š, ž, Ђ, Ѓ, Ј, Љ, Њ, Ћ, Џ, А, Б, В, Г, Д, Е, Ж, З, И, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х, Ц, Ч, Ш, Я, а, б, в, г, д, е, ж, з, и, й, к, л, м, н, о, п, р, с, т, у, ф, х, ц, ч, ш, щ, ъ, ы, ь, ю, я, ё, ђ, ј, љ, њ, ћ, ќ, џ, –, —, ’, “, ”, „, …]
Characters: [ , !, ", &, ', (, ), *, +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, A, B, C, Ç, D, E, É, Ë, F, G, H, I, Í, J, K, L, M, N, O, Ó, P, R, S, T, U, Ú, Ü, V, W, X, Y, Z, [, ], _, `, a, à, á, â, ã, ă, ä, b, b̃, c, c̃, ç, d, d̃, e, è, é, ê, ẽ, ë, f, f̃, g, g̃, h, h̃, i, í, î, ĩ, j, j̃, k, k̃, l, l̃, m, m̃, n, ñ, o, ó, õ, ö, p, p̃, q, q̃, r, r̃, s, s̃, t, t̃, u, ú, ũ, ü, v, ṽ, w, w̃, x, x̃, y, ỹ, z, z̃, ¤, §, °, ¶, ¸, Æ, æ, ý, ć, Č, č, Đ, ě, į, ł, ň, Œ, œ, ř, ş, Š, š, ţ, ž, ſ, ζ, ψ, ω, Ђ, Ѓ, І, Ј, Љ, Њ, Ћ, Џ, А, Б, В, Г, Д, Е, Ж, З, И, Й, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х, Ц, Ч, Ш, Ъ, Ю, Я, а, б, в, г, д, е, ж, з, и, й, к, л, м, н, о, п, р, с, т, у, ф, х, ц, ч, ш, щ, ъ, ы, ь, э, ю, я, ё, ђ, ѓ, і, ј, љ, њ, ћ, ќ, џ, –, —, ‘, ’, “, ”, „, …, ]
Num characters: 248

This model includes some symbols from Russian alphabet, Greek alphabet, and some symbols from latin alphabet with diacritics which are totally unnecessary for my purpose. I understand I can eliminate those extras by increasing threshold of appearance of letters while training the language model. But what I don't know is how to insert the needed accented vowels into the model? Because I don't have a text that uses them. I trained my model using PDFs of books converted to txt files. Another question is how do I preserve the text formatting after ocr? Is it a problem for ocular that text is in two columns? Is it going to recognize/preserve the columns, or is it going to produce one line of text by concatenating left and right columns in the same line? My final goal is to parse the text that passed the OCR and produce a searchable database of words and definitions, i.e. an electronic dictionary.

To summarize: is ocular suitable for this task?

Thank you very much.