Improve for scientific Latin

wollmers commented 8 years ago

Did my first tests with scientific Latin texts. While usually with German fraktur the results of tesseract are significantly better than the ABBY results on archive.org, it seems the opposite with Latin.

Example files: https://github.com/wollmers/ocr-lat-bio-testfiles/tree/master/Scopoli_1763_vindobona

Comparison script: https://github.com/wollmers/ocr-measures/blob/master/ocr_compare.pl

tesseract https://github.com/wollmers/ocr-lat-bio-testfiles/blob/master/Scopoli_1763_vindobona/ioannisantoiisc03scop_08.tess_3.04lat.compare.txt

             lines  words  chars
items ocr:      20     99    726 
items grt:      20    106    708 
matches:         1     43    558 
edits:          19     67    199 
 subss:         19     52    119 
 inserts:        0      4     49 
 deletions:      0     11     31 
precision:  0.0500 0.4343 0.7686 
recall:     0.0500 0.4057 0.7881 
accuracy:   0.0500 0.3909 0.7371 
f-score:    0.0500 0.4195 0.7782

abby https://github.com/wollmers/ocr-lat-bio-testfiles/blob/master/Scopoli_1763_vindobona/ioannisantoniisc03scop_08.abby.compare.txt

             lines  words  chars
items ocr:      20     86    687 
items grt:      20    106    707 
matches:         1     48    628 
edits:          19     58     86 
 subss:         19     38     52 
 inserts:        0      0      7 
 deletions:      0     20     27 
precision:  0.0500 0.5581 0.9141 
recall:     0.0500 0.4528 0.8883 
accuracy:   0.0500 0.4528 0.8796 
f-score:    0.0500 0.5000 0.9010

With this clean test image (no distorts, no shadows, no mud) I would expect scores for words around 0.9 and for chars around 0.95 without special training.

What can be the reasons?

missing fonts?
to many characters supported?
narrow focus on classic Latin?
wordlists?

I can try in the next days to get better results with modified training, expanded wordlist (will keep the new ones extra as I do it with orthographic variants).

Maybe my lat.traineddata is outdated, but it is the official one fetched from tesseract-ocr.

ryanfb commented 8 years ago

Not sure who's been working on the official tesseract-ocr lat.traineddata (I don't think it was there before 3.04?), but it looks to be built from the langdata repo with the new tesstrain.sh script: https://github.com/tesseract-ocr/langdata/tree/master/lat

I haven't yet taken the time to look into moving my training process/sources into the new tesstrain system, and testing how that affects things. The md5sum for the official built lat.traineddata is 36e6efb824f9b28a4f928945a1907e81, which doesn't match any of my releases. If I run my v0.2.2 lat.traineddata release against your example in Tesseract 3.04 I get: https://gist.github.com/ryanfb/df93833292c2c1695e01

And with your comparison script: https://gist.github.com/ryanfb/3618b87da056c9f71496

             lines  words  chars
items ocr:      20    100    729 
items grt:      20    106    707 
matches:         1     53    600 
edits:          19     53    145 
 subss:         19     47     91 
 inserts:        0      0     38 
 deletions:      0      6     16 
precision:  0.0500 0.5300 0.8230 
recall:     0.0500 0.5000 0.8487 
accuracy:   0.0500 0.5000 0.8054 
f-score:    0.0500 0.5146 0.8357

So, a bit better, but not great (though some char inaccuracy may be due to my long-s normalization). I suspect some may be due to the font, which appears pretty different from most of the synthetic fonts I train against (to my eye/recollection). If there's a free synthetic font (e.g. OTF/TTF) that you know of which is similar, I can experiment with adding it to the training.

wollmers commented 8 years ago

Your trial looks much better (+/- long s). You can be right with the font issue, when I look at the detailled alignment. Will see if I can get a better fitting font and try to train it. Will be hard because you still included all good quality fonts for Latin as I have seen.

Thx so far.

wollmers commented 8 years ago

No luck so far. Tried to build on Debian (on OSX segfaults) but it ran 12 hours, consuming 16 GB memory, 20 GB disk space, no trained data file as a result. Then reduced the fonts to the most similar EBGaramond, reduced the wordlist to 10 Kwords, the same long runtime without result. Does it need something special? Does it need newest tesseract 3.04 (3.03.03-1 is too old)? Will dedicate a machine with not yet stable debian for it and give it another try.

A font which maybe is an improvement is Junicode available on sourceforge and CTAN.

ryanfb commented 8 years ago

What is the failing build result/output on Debian (i.e. does it fail on cntraining, mftraining, text2image…)? Unfortunately mftraining takes a long time (several days as I recall in some instances), partly because it isn't multi-threaded or easily split up into parallelizeable chunks. Previous traineddata files were built against 3.03-rc1 in Docker (with Ubuntu): https://github.com/ryanfb/tesseract_latinocr_docker

I'll try to look into Junicode as well as seeing what the new tesstrain.sh system is like.

wollmers commented 8 years ago

I don't know where or why it exactly fails.

The *tr files are all present.

What I copied from the console is AFAIR the last where it remains for ~10 hrs consuming 100% CPU of one core, up to 16 GB memory:

Extracting unicharset from lat.liga.EBGaramondItalic.exp3.box
Wrote unicharset file ./unicharset.
set_unicharset_properties -U unicharset -O lat.earlyunicharset --script_dir .
Loaded unicharset of size 140 from file unicharsetSetting unichar propertiesOther case JOINED of   Joined is not in unicharsetOther case |BROKEN|0|1 of |Broken|0|1 is not in unicharsetOther case ST of ſt is not in unicharsetOther case US of us is not in unicharsetOther case IS of is is not in unicharsetOther case AS of as is not in unicharsetOther case FF of ff is not in unicharsetOther case ES of es is not in unicharsetOther case FI of fi is not in unicharsetOther case FFL of ffl is not in unicharsetOther case FL of fl is not in unicharsetOther case GG of gg is not in unicharsetOther case SB of ſb is not in unicharsetOther case FK of fk is not in unicharsetOther case GY of gy is not in unicharsetOther case FFI of ffi is not in unicharsetOther case FT of ft is not in unicharsetOther case SH of ſh is not in unicharsetOther case SI of ſi is not in unicharsetOther case SS of ſs is not in unicharsetOther case SJ of ſj is not in unicharsetWriting unicharset to file lat.earlyunicharsetrm unicharset
mftraining -F font_properties -U lat.earlyunicharset -O lat.unicharset lat*tr
Warning: No shape table file present: shapetable
Reading lat.EBGaramond.exp-1.tr ...

Unfortunately I could not see an error message why lat.tessdata was not written, if there was one, because I run it in an SSH-session on my MacAir connected to a remote servers. I should do it more reliable the next time.

NB: Normalizing the long s to round s does not improve the measures. Seems tessseract has problems with the segmentation and recognition of italic fonts with such fine lines. This m-n-u confusion illustrates this:

$ perl -e 'use String::Similarity;print similarity("eonnnnnicaio","communicato"),"\n";'
0.521739130434783

This is nearly impossible to spell correct automagically.

ryanfb / latinocr-lat

Improve for scientific Latin #3