Closed jmokoistinen closed 8 years ago
The use of diacritics shouldn't make a difference; we use them without issues.
When the model outputs total junk, it usually means that the image is too hard to read. If you use an -extractedLinesPath, it will write out the binarized images that it's actually using. Check those and see if its coming out very light or very dark, and adjust the -binarizeThreshold if necessary.
Otherwise, maybe send me the image files and LM training data and I'll have a look?
Closing this issue since I haven't heard back.
I was able to get it working by initializing a new LM with a more succinct character set (trained from some Project Gutenberg books). If the model has access to too many very-low-frequency characters, EM gets confused because it thinks it needs to give probability mass to them, when in reality it would do better just to ignore them completely.
My model used:
[ , !, ", &, ', (, ), *, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, ?, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, [, \"A, \"O, \"a, \"e, \"o, \"u, \'a, \'e, \'o, \^a, \`a, \`e, \cc, ], _, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, §, ¶, Å, å]
finnish3.lmser contains:
[ , !, ", &, ', (, ), *, +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, =, ?, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, R, S, T, U, V, W, X, Y, Z, [, \"A, \"O, \"a, \"o, \"u, \'a, \'e, \`a, ], a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, §, «, ¶, », ¼, ½, ¾, Å, å, æ, ˮ, ˶, а, в, е, и, к, л, м, н, о, р, у, ъ, —, ’, “, ”, „, †, ⅓]
Try initializing a new LM, or increasing the character count threshold using the -minCharCount
option on InitializeLanguageModel.
Hello,
I don't want to open a new issue since I'm having a similar problem. First of all, I must say I'm not a Java guy. Never used it before. I'm more of a PHP/JavaScript guy. Second, I'm a total newbie at machine learning. Just started an online course at Udemy.
However, I needed to use ocular to OCR an old dictionary of Serbian language. It has 6 volumes and about 5500 pages. The source is a batch of PDFs with scanned dictionary which I then split into .png pages. The problem is that this was a phototypic reprint of the original dictionary. The quality of the scan is good, but the original was very bad because of ink smudges. Here's a sample page:
The other problem is the very type of text. Dictionary uses accented vowels to denote pronunciation of words. These accented vowels are not present in ordinary Serbian language texts that I can use to initialize a language model and train a font. One more problem is that italic letters are quite different from the normal letters in the cyrillic alphabet.
Seven years ago I used Abby Fine Reader, and it produced pretty decent results, however, the main thing, the dictionary words (lemmas) were not recognized correctly due to accent marks.
Now, what I'm getting form oculus is this:
_________________________________________________________________________________į”,
џ
(
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
Ш_—Yв
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
И
*
Ш—YJ
*
Which is no way near perfect. My language model included following symbols:
Loading initial LM from lm/serbian.lmser
Loaded CodeSwitchLanguageModel from lm/serbian.lmser
NoLanguageNameGiven: [ , !, ", &, ', (, ), *, +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, =, ?, A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, R, S, T, U, V, W, X, Y, Z, [, ], _, a, á, ä, b, c, d, e, é, ë, f, g, h, i, í, j, k, l, m, n, o, ó, p, q, r, s, t, u, ü, v, w, x, y, z, §, °, ¶, ć, č, ř, š, ž, Ђ, Ѓ, Ј, Љ, Њ, Ћ, Џ, А, Б, В, Г, Д, Е, Ж, З, И, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х, Ц, Ч, Ш, Я, а, б, в, г, д, е, ж, з, и, й, к, л, м, н, о, п, р, с, т, у, ф, х, ц, ч, ш, щ, ъ, ы, ь, ю, я, ё, ђ, ј, љ, њ, ћ, ќ, џ, –, —, ’, “, ”, „, …]
Characters: [ , !, ", &, ', (, ), *, +, ,, -, ., /, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, :, ;, <, =, >, ?, A, B, C, Ç, D, E, É, Ë, F, G, H, I, Í, J, K, L, M, N, O, Ó, P, R, S, T, U, Ú, Ü, V, W, X, Y, Z, [, ], _, `, a, à, á, â, ã, ă, ä, b, b̃, c, c̃, ç, d, d̃, e, è, é, ê, ẽ, ë, f, f̃, g, g̃, h, h̃, i, í, î, ĩ, j, j̃, k, k̃, l, l̃, m, m̃, n, ñ, o, ó, õ, ö, p, p̃, q, q̃, r, r̃, s, s̃, t, t̃, u, ú, ũ, ü, v, ṽ, w, w̃, x, x̃, y, ỹ, z, z̃, ¤, §, °, ¶, ¸, Æ, æ, ý, ć, Č, č, Đ, ě, į, ł, ň, Œ, œ, ř, ş, Š, š, ţ, ž, ſ, ζ, ψ, ω, Ђ, Ѓ, І, Ј, Љ, Њ, Ћ, Џ, А, Б, В, Г, Д, Е, Ж, З, И, Й, К, Л, М, Н, О, П, Р, С, Т, У, Ф, Х, Ц, Ч, Ш, Ъ, Ю, Я, а, б, в, г, д, е, ж, з, и, й, к, л, м, н, о, п, р, с, т, у, ф, х, ц, ч, ш, щ, ъ, ы, ь, э, ю, я, ё, ђ, ѓ, і, ј, љ, њ, ћ, ќ, џ, –, —, ‘, ’, “, ”, „, …, ]
Num characters: 248
This model includes some symbols from Russian alphabet, Greek alphabet, and some symbols from latin alphabet with diacritics which are totally unnecessary for my purpose. I understand I can eliminate those extras by increasing threshold of appearance of letters while training the language model. But what I don't know is how to insert the needed accented vowels into the model? Because I don't have a text that uses them. I trained my model using PDFs of books converted to txt files. Another question is how do I preserve the text formatting after ocr? Is it a problem for ocular that text is in two columns? Is it going to recognize/preserve the columns, or is it going to produce one line of text by concatenating left and right columns in the same line? My final goal is to parse the text that passed the OCR and produce a searchable database of words and definitions, i.e. an electronic dictionary.
To summarize: is ocular suitable for this task?
Thank you very much.
I created model for Finnish
What could be wrong?
eval_diplomatic.txt
Document: sample_images/finnish/1.jpg CER, keep punc: 0.974025974025974 CER, keep punc, allow f->s: 0.974025974025974 CER, remove punc: 0.9970326409495549 CER, remove punc, allow f->s: 0.9970326409495549 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 1.0 WER, remove punc, allow f->s: 1.0
Document: sample_images/finnish/2.jpg CER, keep punc: 0.98094688221709 CER, keep punc, allow f->s: 0.98094688221709 CER, remove punc: 0.9940369707811568 CER, remove punc, allow f->s: 0.9940369707811568 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 1.0 WER, remove punc, allow f->s: 1.0
Document: sample_images/finnish/3.jpg CER, keep punc: 0.9778783308195073 CER, keep punc, allow f->s: 0.9778783308195073 CER, remove punc: 0.9947451392538098 CER, remove punc, allow f->s: 0.9947451392538098 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 0.996309963099631 WER, remove punc, allow f->s: 0.996309963099631
Macro-avg total eval: CER, keep punc: 0.9776170623541904 CER, keep punc, allow f->s: 0.9776170623541904 CER, remove punc: 0.9952715836615071 CER, remove punc, allow f->s: 0.9952715836615071 WER, keep punc: 1.0 WER, keep punc, allow f->s: 1.0 WER, remove punc: 0.998769987699877 WER, remove punc, allow f->s: 0.998769987699877
1_transcription.txt
uºгp
(
П
(
(
(
(
(
⅗
m
A
(
(
(
(
(
(
tºг⅙
˶
=
⅗
v
1_comparisons.txt MD: Model diplomatic transcription GD: Gold diplomatic transcription
MD: GD: my\"oten erinnyt, yksi osa seurannut Kamajokea pohjaseen p\"ain,
MD: .őд1 GD: toinen Wolgajokea l\"ansiluoteesen ja kolmas wasta nimitetty\"a
MD: ú GD: jokea Kaspiamereen p\"ain. Ensimm\"ainen osa, Karjalai- ....
Thanks, Mika Koistinen