tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

Decrease of Recognition: Training from existing Tesseract Model ara #213

Closed M3ssman closed 3 years ago

M3ssman commented 3 years ago

Hello,

we're trying to improve the existing Model for arabic (https://github.com/tesseract-ocr/tessdata_best/blob/master/ara.traineddata) with some additional training data.

Right from the start, the existing ara.traineddata performs rather bad, with an error rate about 20% against a gt-set of ca. 100 pages (txt format). Therefore we created raining data of about 3.500 lines. This improved recognition about 2-3 % at best. This is surely not sufficient, so we were looking for more training data.

This way we came across https://github.com/OpenITI/TrainingData/tree/master/JSTORArabic from Open ITI. This repository contains about 8.000 text lines. If we train using this (but starting again with ara.traineddata) the resulting model after 10.000 iterations performs on our test set with a correction rate of 6.575 %. This is the correction rate, yes! CER is more than 94 %.

I guess enhancing the iterations wouldnt do any good, since the output of error rates is rather depressing:

At iteration 100/100/106, Mean rms=3.551%, delta=26.757%, char train=52.921%, word train=97.572%, skip ratio=6%,  New best char error = 52.921 wrote best model:/home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara52.921_100.checkpoint wrote checkpoint.
At iteration 200/200/212, Mean rms=3.884%, delta=34.202%, char train=75.792%, word train=98.75%, skip ratio=6%,  New worst char error = 75.792 wrote checkpoint.
At iteration 300/300/322, Mean rms=3.994%, delta=36.895%, char train=83.813%, word train=99.167%, skip ratio=7.333%,  New worst char error = 83.813 wrote checkpoint.
At iteration 400/400/427, Mean rms=4.044%, delta=38.596%, char train=87.985%, word train=99.375%, skip ratio=6.75%,  New worst char error = 87.985 wrote checkpoint.
At iteration 500/500/531, Mean rms=4.079%, delta=39.889%, char train=90.388%, word train=99.5%, skip ratio=6.2%,  New worst char error = 90.388 wrote checkpoint.
At iteration 600/600/636, Mean rms=4.101%, delta=40.687%, char train=91.99%, word train=99.583%, skip ratio=6%,  New worst char error = 91.99 wrote checkpoint.
At iteration 700/700/745, Mean rms=4.117%, delta=41.43%, char train=93.134%, word train=99.643%, skip ratio=6.429%,  New worst char error = 93.134 wrote checkpoint.
At iteration 800/800/849, Mean rms=4.126%, delta=41.889%, char train=94.117%, word train=99.688%, skip ratio=6.125%,  New worst char error = 94.117 wrote checkpoint.
At iteration 900/900/955, Mean rms=4.129%, delta=42.14%, char train=94.827%, word train=99.722%, skip ratio=6.111%,  New worst char error = 94.827 wrote checkpoint.
At iteration 1000/1000/1060, Mean rms=4.134%, delta=42.306%, char train=95.344%, word train=99.75%, skip ratio=6%,  New worst char error = 95.344 wrote checkpoint.
At iteration 1100/1100/1170, Mean rms=4.191%, delta=43.938%, char train=100.052%, word train=99.993%, skip ratio=6.4%,  New worst char error = 100.052 wrote checkpoint.
At iteration 1200/1200/1274, Mean rms=4.183%, delta=44.11%, char train=100.186%, word train=100%, skip ratio=6.2%,  New worst char error = 100.186 wrote checkpoint.
At iteration 1300/1300/1381, Mean rms=4.17%, delta=44.068%, char train=100.267%, word train=100%, skip ratio=5.9%,  New worst char error = 100.267Previous test incomplete, skipping test at iteration1200 wrote checkpoint.
At iteration 1400/1400/1486, Mean rms=4.173%, delta=44.281%, char train=100.217%, word train=100%, skip ratio=5.9%,  wrote checkpoint.
At iteration 1500/1500/1589, Mean rms=4.167%, delta=44.203%, char train=100.283%, word train=100%, skip ratio=5.8%,  New worst char error = 100.283Previous test incomplete, skipping test at iteration1300 wrote checkpoint.
At iteration 1600/1600/1696, Mean rms=4.158%, delta=44.037%, char train=100.35%, word train=100%, skip ratio=6%,  New worst char error = 100.35At iteration 1100, stage 0, Eval Char error rate=100.12453, Word error rate=100 wrote checkpoint.
At iteration 1700/1700/1804, Mean rms=4.153%, delta=43.845%, char train=100.4%, word train=100%, skip ratio=5.9%,  New worst char error = 100.4Previous test incomplete, skipping test at iteration1600 wrote checkpoint.
At iteration 1800/1800/1907, Mean rms=4.15%, delta=43.673%, char train=100.323%, word train=100%, skip ratio=5.8%,  wrote checkpoint.
At iteration 1900/1900/2013, Mean rms=4.149%, delta=43.717%, char train=100.298%, word train=100%, skip ratio=5.8%,  wrote checkpoint.
At iteration 2000/2000/2118, Mean rms=4.151%, delta=43.872%, char train=100.437%, word train=100%, skip ratio=5.8%,  New worst char error = 100.437At iteration 1500, stage 0, Eval Char error rate=100.47584, Word error rate=100 wrote checkpoint.
At iteration 2100/2100/2220, Mean rms=4.16%, delta=44.108%, char train=100.594%, word train=100%, skip ratio=5%,  New worst char error = 100.594Previous test incomplete, skipping test at iteration2000 wrote checkpoint.
At iteration 2200/2200/2328, Mean rms=4.157%, delta=44.005%, char train=100.71%, word train=100%, skip ratio=5.4%,  New worst char error = 100.71Previous test incomplete, skipping test at iteration2100 wrote checkpoint.
At iteration 2300/2300/2430, Mean rms=4.154%, delta=43.994%, char train=100.661%, word train=99.961%, skip ratio=4.9%,  wrote checkpoint.
At iteration 2400/2400/2533, Mean rms=4.147%, delta=43.746%, char train=100.775%, word train=99.961%, skip ratio=4.7%,  New worst char error = 100.775At iteration 1700, stage 0, Eval Char error rate=100.5045, Word error rate=100 wrote checkpoint.
At iteration 2500/2500/2638, Mean rms=4.141%, delta=43.554%, char train=100.783%, word train=99.877%, skip ratio=4.9%,  New worst char error = 100.783Previous test incomplete, skipping test at iteration2400 wrote checkpoint.
At iteration 2600/2600/2741, Mean rms=4.144%, delta=43.655%, char train=100.89%, word train=99.867%, skip ratio=4.5%,  New worst char error = 100.89Previous test incomplete, skipping test at iteration2500 wrote checkpoint.
At iteration 2700/2700/2852, Mean rms=4.145%, delta=43.796%, char train=100.882%, word train=99.803%, skip ratio=4.8%,  wrote checkpoint.
At iteration 2800/2800/2960, Mean rms=4.139%, delta=43.766%, char train=101.002%, word train=99.73%, skip ratio=5.3%,  New worst char error = 101.002At iteration 2200, stage 0, Eval Char error rate=100.30872, Word error rate=99.875467 wrote checkpoint.
At iteration 2900/2900/3064, Mean rms=4.136%, delta=43.651%, char train=101.058%, word train=99.73%, skip ratio=5.1%,  New worst char error = 101.058Previous test incomplete, skipping test at iteration2800 wrote checkpoint.
At iteration 3000/3000/3169, Mean rms=4.133%, delta=43.606%, char train=100.927%, word train=99.707%, skip ratio=5.1%,  wrote checkpoint.
At iteration 3100/3100/3271, Mean rms=4.126%, delta=43.381%, char train=100.99%, word train=99.668%, skip ratio=5.1%,  wrote checkpoint.
At iteration 3200/3200/3373, Mean rms=4.131%, delta=43.581%, char train=100.857%, word train=99.668%, skip ratio=4.5%,  wrote checkpoint.
At iteration 3300/3300/3479, Mean rms=4.137%, delta=43.828%, char train=100.833%, word train=99.707%, skip ratio=4.9%,  wrote checkpoint.
At iteration 3400/3400/3590, Mean rms=4.139%, delta=43.944%, char train=100.639%, word train=99.687%, skip ratio=5.7%,  wrote checkpoint.
At iteration 3500/3500/3694, Mean rms=4.148%, delta=44.289%, char train=100.486%, word train=99.75%, skip ratio=5.6%,  wrote checkpoint.
At iteration 3600/3600/3800, Mean rms=4.145%, delta=44.249%, char train=100.335%, word train=99.552%, skip ratio=5.9%,  wrote checkpoint.
At iteration 3700/3700/3902, Mean rms=4.139%, delta=44.184%, char train=100.231%, word train=99.446%, skip ratio=5%,  wrote checkpoint.
At iteration 3800/3800/4009, Mean rms=4.142%, delta=44.218%, char train=100.01%, word train=99.331%, skip ratio=4.9%,  wrote checkpoint.
At iteration 3900/3900/4115, Mean rms=4.144%, delta=44.237%, char train=99.734%, word train=99.172%, skip ratio=5.1%,  wrote checkpoint.
At iteration 4000/4000/4218, Mean rms=4.149%, delta=44.362%, char train=99.694%, word train=99.183%, skip ratio=4.9%,  wrote checkpoint.
At iteration 4100/4100/4323, Mean rms=4.148%, delta=44.448%, char train=99.468%, word train=99.07%, skip ratio=5.2%,  wrote checkpoint.
At iteration 4200/4200/4429, Mean rms=4.148%, delta=44.465%, char train=99.479%, word train=98.953%, skip ratio=5.6%,  wrote checkpoint.
At iteration 4300/4300/4532, Mean rms=4.143%, delta=44.299%, char train=99.545%, word train=98.916%, skip ratio=5.3%,  wrote checkpoint.
At iteration 4400/4400/4638, Mean rms=4.135%, delta=44.073%, char train=99.652%, word train=98.761%, skip ratio=4.8%,  wrote checkpoint.
At iteration 4500/4500/4742, Mean rms=4.133%, delta=43.927%, char train=99.721%, word train=98.706%, skip ratio=4.8%,  wrote checkpoint.
At iteration 4600/4600/4848, Mean rms=4.135%, delta=43.959%, char train=99.595%, word train=98.887%, skip ratio=4.8%,  wrote checkpoint.
At iteration 4700/4700/4954, Mean rms=4.138%, delta=43.839%, char train=99.536%, word train=98.98%, skip ratio=5.2%,  wrote checkpoint.
At iteration 4800/4800/5057, Mean rms=4.142%, delta=44.035%, char train=99.575%, word train=99.147%, skip ratio=4.8%,  wrote checkpoint.
At iteration 4900/4900/5163, Mean rms=4.145%, delta=44.12%, char train=99.596%, word train=99.201%, skip ratio=4.8%,  wrote checkpoint.
At iteration 5000/5000/5270, Mean rms=4.138%, delta=43.965%, char train=99.494%, word train=99.143%, skip ratio=5.2%,  wrote checkpoint.
At iteration 5100/5100/5370, Mean rms=4.134%, delta=43.829%, char train=99.382%, word train=99.177%, skip ratio=4.7%,  wrote checkpoint.
At iteration 5200/5200/5481, Mean rms=4.136%, delta=43.899%, char train=99.49%, word train=99.168%, skip ratio=5.2%,  wrote checkpoint.
At iteration 5300/5300/5586, Mean rms=4.134%, delta=43.705%, char train=99.306%, word train=99.164%, skip ratio=5.4%,  wrote checkpoint.
At iteration 5400/5400/5692, Mean rms=4.141%, delta=43.94%, char train=99.312%, word train=99.294%, skip ratio=5.4%,  wrote checkpoint.
At iteration 5500/5500/5800, Mean rms=4.133%, delta=43.764%, char train=99.425%, word train=99.305%, skip ratio=5.8%,  wrote checkpoint.
At iteration 5600/5600/5908, Mean rms=4.121%, delta=43.498%, char train=99.3%, word train=99.321%, skip ratio=6%,  wrote checkpoint.
At iteration 5700/5700/6015, Mean rms=4.116%, delta=43.338%, char train=99.397%, word train=99.399%, skip ratio=6.1%,  wrote checkpoint.
At iteration 5800/5800/6120, Mean rms=4.102%, delta=42.913%, char train=99.6%, word train=99.385%, skip ratio=6.3%,  wrote checkpoint.
At iteration 5900/5900/6226, Mean rms=4.091%, delta=42.585%, char train=99.651%, word train=99.443%, skip ratio=6.3%,  wrote checkpoint.
At iteration 6000/6000/6330, Mean rms=4.089%, delta=42.484%, char train=99.813%, word train=99.499%, skip ratio=6%,  wrote checkpoint.
At iteration 6100/6100/6436, Mean rms=4.089%, delta=42.464%, char train=100.107%, word train=99.586%, skip ratio=6.6%,  wrote checkpoint.
At iteration 6200/6200/6542, Mean rms=4.077%, delta=42.096%, char train=100.168%, word train=99.548%, skip ratio=6.1%,  wrote checkpoint.
At iteration 6300/6300/6647, Mean rms=4.08%, delta=42.266%, char train=100.419%, word train=99.51%, skip ratio=6.1%,  wrote checkpoint.
At iteration 6400/6400/6752, Mean rms=4.077%, delta=42.187%, char train=100.048%, word train=99.41%, skip ratio=6%,  wrote checkpoint.
At iteration 6500/6500/6859, Mean rms=4.076%, delta=42.108%, char train=99.679%, word train=99.418%, skip ratio=5.9%,  wrote checkpoint.
At iteration 6600/6600/6970, Mean rms=4.081%, delta=42.207%, char train=99.785%, word train=99.346%, skip ratio=6.2%,  wrote checkpoint.
At iteration 6700/6700/7079, Mean rms=4.071%, delta=42.103%, char train=99.803%, word train=99.263%, skip ratio=6.4%,  wrote checkpoint.
At iteration 6800/6800/7183, Mean rms=4.085%, delta=42.341%, char train=99.475%, word train=99.291%, skip ratio=6.3%,  wrote checkpoint.
At iteration 6900/6900/7289, Mean rms=4.084%, delta=42.431%, char train=99.601%, word train=99.328%, skip ratio=6.3%,  wrote checkpoint.
At iteration 7000/7000/7393, Mean rms=4.087%, delta=42.49%, char train=99.416%, word train=99.317%, skip ratio=6.3%,  wrote checkpoint.
At iteration 7100/7100/7503, Mean rms=4.082%, delta=42.324%, char train=98.974%, word train=99.349%, skip ratio=6.7%,  wrote checkpoint.
At iteration 7200/7200/7608, Mean rms=4.084%, delta=42.438%, char train=98.744%, word train=99.493%, skip ratio=6.6%,  wrote checkpoint.
At iteration 7300/7300/7712, Mean rms=4.09%, delta=42.561%, char train=98.512%, word train=99.571%, skip ratio=6.5%,  wrote checkpoint.
At iteration 7400/7400/7819, Mean rms=4.094%, delta=42.664%, char train=98.767%, word train=99.578%, skip ratio=6.7%,  wrote checkpoint.
At iteration 7500/7500/7926, Mean rms=4.097%, delta=42.695%, char train=98.876%, word train=99.634%, skip ratio=6.7%,  wrote checkpoint.
At iteration 7600/7600/8031, Mean rms=4.103%, delta=42.815%, char train=98.873%, word train=99.674%, skip ratio=6.1%,  wrote checkpoint.
At iteration 7700/7700/8136, Mean rms=4.104%, delta=42.788%, char train=99.083%, word train=99.701%, skip ratio=5.7%,  wrote checkpoint.
At iteration 7800/7800/8242, Mean rms=4.102%, delta=42.705%, char train=99.171%, word train=99.706%, skip ratio=5.9%,  wrote checkpoint.
At iteration 7900/7900/8351, Mean rms=4.11%, delta=42.824%, char train=99.208%, word train=99.7%, skip ratio=6.2%,  wrote checkpoint.
At iteration 8000/8000/8458, Mean rms=4.102%, delta=42.554%, char train=99.44%, word train=99.615%, skip ratio=6.5%,  wrote checkpoint.
At iteration 8100/8100/8562, Mean rms=4.098%, delta=42.474%, char train=99.523%, word train=99.554%, skip ratio=5.9%,  wrote checkpoint.
At iteration 8200/8200/8667, Mean rms=4.098%, delta=42.384%, char train=99.402%, word train=99.559%, skip ratio=5.9%,  wrote checkpoint.
At iteration 8300/8300/8772, Mean rms=4.096%, delta=42.411%, char train=99.509%, word train=99.559%, skip ratio=6%,  wrote checkpoint.
At iteration 8400/8400/8875, Mean rms=4.087%, delta=42.151%, char train=99.518%, word train=99.69%, skip ratio=5.6%,  wrote checkpoint.
At iteration 8500/8500/8983, Mean rms=4.083%, delta=42.059%, char train=99.541%, word train=99.625%, skip ratio=5.7%,  wrote checkpoint.
At iteration 8600/8600/9090, Mean rms=4.084%, delta=42.029%, char train=99.476%, word train=99.655%, skip ratio=5.9%,  wrote checkpoint.
At iteration 8700/8700/9193, Mean rms=4.093%, delta=42.315%, char train=99.205%, word train=99.712%, skip ratio=5.7%,  wrote checkpoint.
At iteration 8800/8800/9299, Mean rms=4.091%, delta=42.294%, char train=99.158%, word train=99.712%, skip ratio=5.7%,  wrote checkpoint.
At iteration 8900/8900/9404, Mean rms=4.088%, delta=42.305%, char train=98.977%, word train=99.689%, skip ratio=5.3%,  wrote checkpoint.
At iteration 9000/9000/9509, Mean rms=4.092%, delta=42.42%, char train=99.033%, word train=99.72%, skip ratio=5.1%,  wrote checkpoint.
At iteration 9100/9100/9613, Mean rms=4.092%, delta=42.345%, char train=99.031%, word train=99.699%, skip ratio=5.1%,  wrote checkpoint.
At iteration 9200/9200/9716, Mean rms=4.087%, delta=42.185%, char train=99.069%, word train=99.713%, skip ratio=4.9%,  wrote checkpoint.
At iteration 9300/9300/9818, Mean rms=4.081%, delta=41.897%, char train=98.873%, word train=99.643%, skip ratio=4.6%,  wrote checkpoint.
At iteration 9400/9400/9924, Mean rms=4.078%, delta=41.745%, char train=98.903%, word train=99.621%, skip ratio=4.9%,  wrote checkpoint.
At iteration 9500/9500/10032, Mean rms=4.081%, delta=41.896%, char train=98.824%, word train=99.607%, skip ratio=4.9%,  wrote checkpoint.
At iteration 9600/9600/10141, Mean rms=4.071%, delta=41.748%, char train=99.062%, word train=99.608%, skip ratio=5.1%,  wrote checkpoint.
At iteration 9700/9700/10246, Mean rms=4.065%, delta=41.575%, char train=98.917%, word train=99.576%, skip ratio=5.3%,  wrote checkpoint.
At iteration 9800/9800/10352, Mean rms=4.063%, delta=41.608%, char train=98.708%, word train=99.462%, skip ratio=5.3%,  wrote checkpoint.
At iteration 9900/9900/10455, Mean rms=4.06%, delta=41.485%, char train=98.831%, word train=99.502%, skip ratio=5.1%,  wrote checkpoint.
At iteration 10000/10000/10557, Mean rms=4.059%, delta=41.41%, char train=98.633%, word train=99.572%, skip ratio=4.8%,  wrote checkpoint.

Looks like it starts with rather bad error rate, but drops this very fast an cannot recover. It only writes one checkpoint.

Has anybody tried this training data out by now?

Shreeshrii commented 3 years ago

Please see https://github.com/tesseract-ocr/tesstrain/issues/128

I had tested sometime back for Arabic using the OCR_GS data which is now part of OpenITI.

If you use --debug_interval=-1 you can see details of each iteration. It will help in identifying issues with data.

On Wed, Dec 16, 2020, 13:31 Uwe Hartwig notifications@github.com wrote:

Hello,

we're trying to improve the existing Model for arabic ( https://github.com/tesseract-ocr/tessdata_best/blob/master/ara.traineddata) with some additional training data.

Right from the start, the existing ara.traineddata performs rather bad, with an error rate about 20% against a gt-set of ca. 100 pages (txt format). Therefore we created raining data of about 3.500 lines. This improved recognition about 2-3 % at best. This is surely not sufficient, so we were looking for more training data.

This way we came across https://github.com/OpenITI/TrainingData/tree/master/JSTORArabic from Open ITI. This repository contains about 8.000 text lines. If we train using this (but starting again with ara.traineddata) the resulting model after 10.000 iterations performs on our test set with a correction rate of 6.575 %. This is the correction rate, yes! CER is more than 94 %.

I guess enhancing the iterations wouldnt do any good, since the output of error rates is rather depressing:

At iteration 100/100/106, Mean rms=3.551%, delta=26.757%, char train=52.921%, word train=97.572%, skip ratio=6%, New best char error = 52.921 wrote best model:/home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara52.921_100.checkpoint wrote checkpoint. At iteration 200/200/212, Mean rms=3.884%, delta=34.202%, char train=75.792%, word train=98.75%, skip ratio=6%, New worst char error = 75.792 wrote checkpoint. At iteration 300/300/322, Mean rms=3.994%, delta=36.895%, char train=83.813%, word train=99.167%, skip ratio=7.333%, New worst char error = 83.813 wrote checkpoint. At iteration 400/400/427, Mean rms=4.044%, delta=38.596%, char train=87.985%, word train=99.375%, skip ratio=6.75%, New worst char error = 87.985 wrote checkpoint. At iteration 500/500/531, Mean rms=4.079%, delta=39.889%, char train=90.388%, word train=99.5%, skip ratio=6.2%, New worst char error = 90.388 wrote checkpoint. At iteration 600/600/636, Mean rms=4.101%, delta=40.687%, char train=91.99%, word train=99.583%, skip ratio=6%, New worst char error = 91.99 wrote checkpoint. At iteration 700/700/745, Mean rms=4.117%, delta=41.43%, char train=93.134%, word train=99.643%, skip ratio=6.429%, New worst char error = 93.134 wrote checkpoint. At iteration 800/800/849, Mean rms=4.126%, delta=41.889%, char train=94.117%, word train=99.688%, skip ratio=6.125%, New worst char error = 94.117 wrote checkpoint. At iteration 900/900/955, Mean rms=4.129%, delta=42.14%, char train=94.827%, word train=99.722%, skip ratio=6.111%, New worst char error = 94.827 wrote checkpoint. At iteration 1000/1000/1060, Mean rms=4.134%, delta=42.306%, char train=95.344%, word train=99.75%, skip ratio=6%, New worst char error = 95.344 wrote checkpoint. At iteration 1100/1100/1170, Mean rms=4.191%, delta=43.938%, char train=100.052%, word train=99.993%, skip ratio=6.4%, New worst char error = 100.052 wrote checkpoint. At iteration 1200/1200/1274, Mean rms=4.183%, delta=44.11%, char train=100.186%, word train=100%, skip ratio=6.2%, New worst char error = 100.186 wrote checkpoint. At iteration 1300/1300/1381, Mean rms=4.17%, delta=44.068%, char train=100.267%, word train=100%, skip ratio=5.9%, New worst char error = 100.267Previous test incomplete, skipping test at iteration1200 wrote checkpoint. At iteration 1400/1400/1486, Mean rms=4.173%, delta=44.281%, char train=100.217%, word train=100%, skip ratio=5.9%, wrote checkpoint. At iteration 1500/1500/1589, Mean rms=4.167%, delta=44.203%, char train=100.283%, word train=100%, skip ratio=5.8%, New worst char error = 100.283Previous test incomplete, skipping test at iteration1300 wrote checkpoint. At iteration 1600/1600/1696, Mean rms=4.158%, delta=44.037%, char train=100.35%, word train=100%, skip ratio=6%, New worst char error = 100.35At iteration 1100, stage 0, Eval Char error rate=100.12453, Word error rate=100 wrote checkpoint. At iteration 1700/1700/1804, Mean rms=4.153%, delta=43.845%, char train=100.4%, word train=100%, skip ratio=5.9%, New worst char error = 100.4Previous test incomplete, skipping test at iteration1600 wrote checkpoint. At iteration 1800/1800/1907, Mean rms=4.15%, delta=43.673%, char train=100.323%, word train=100%, skip ratio=5.8%, wrote checkpoint. At iteration 1900/1900/2013, Mean rms=4.149%, delta=43.717%, char train=100.298%, word train=100%, skip ratio=5.8%, wrote checkpoint. At iteration 2000/2000/2118, Mean rms=4.151%, delta=43.872%, char train=100.437%, word train=100%, skip ratio=5.8%, New worst char error = 100.437At iteration 1500, stage 0, Eval Char error rate=100.47584, Word error rate=100 wrote checkpoint. At iteration 2100/2100/2220, Mean rms=4.16%, delta=44.108%, char train=100.594%, word train=100%, skip ratio=5%, New worst char error = 100.594Previous test incomplete, skipping test at iteration2000 wrote checkpoint. At iteration 2200/2200/2328, Mean rms=4.157%, delta=44.005%, char train=100.71%, word train=100%, skip ratio=5.4%, New worst char error = 100.71Previous test incomplete, skipping test at iteration2100 wrote checkpoint. At iteration 2300/2300/2430, Mean rms=4.154%, delta=43.994%, char train=100.661%, word train=99.961%, skip ratio=4.9%, wrote checkpoint. At iteration 2400/2400/2533, Mean rms=4.147%, delta=43.746%, char train=100.775%, word train=99.961%, skip ratio=4.7%, New worst char error = 100.775At iteration 1700, stage 0, Eval Char error rate=100.5045, Word error rate=100 wrote checkpoint. At iteration 2500/2500/2638, Mean rms=4.141%, delta=43.554%, char train=100.783%, word train=99.877%, skip ratio=4.9%, New worst char error = 100.783Previous test incomplete, skipping test at iteration2400 wrote checkpoint. At iteration 2600/2600/2741, Mean rms=4.144%, delta=43.655%, char train=100.89%, word train=99.867%, skip ratio=4.5%, New worst char error = 100.89Previous test incomplete, skipping test at iteration2500 wrote checkpoint. At iteration 2700/2700/2852, Mean rms=4.145%, delta=43.796%, char train=100.882%, word train=99.803%, skip ratio=4.8%, wrote checkpoint. At iteration 2800/2800/2960, Mean rms=4.139%, delta=43.766%, char train=101.002%, word train=99.73%, skip ratio=5.3%, New worst char error = 101.002At iteration 2200, stage 0, Eval Char error rate=100.30872, Word error rate=99.875467 wrote checkpoint. At iteration 2900/2900/3064, Mean rms=4.136%, delta=43.651%, char train=101.058%, word train=99.73%, skip ratio=5.1%, New worst char error = 101.058Previous test incomplete, skipping test at iteration2800 wrote checkpoint. At iteration 3000/3000/3169, Mean rms=4.133%, delta=43.606%, char train=100.927%, word train=99.707%, skip ratio=5.1%, wrote checkpoint. At iteration 3100/3100/3271, Mean rms=4.126%, delta=43.381%, char train=100.99%, word train=99.668%, skip ratio=5.1%, wrote checkpoint. At iteration 3200/3200/3373, Mean rms=4.131%, delta=43.581%, char train=100.857%, word train=99.668%, skip ratio=4.5%, wrote checkpoint. At iteration 3300/3300/3479, Mean rms=4.137%, delta=43.828%, char train=100.833%, word train=99.707%, skip ratio=4.9%, wrote checkpoint. At iteration 3400/3400/3590, Mean rms=4.139%, delta=43.944%, char train=100.639%, word train=99.687%, skip ratio=5.7%, wrote checkpoint. At iteration 3500/3500/3694, Mean rms=4.148%, delta=44.289%, char train=100.486%, word train=99.75%, skip ratio=5.6%, wrote checkpoint. At iteration 3600/3600/3800, Mean rms=4.145%, delta=44.249%, char train=100.335%, word train=99.552%, skip ratio=5.9%, wrote checkpoint. At iteration 3700/3700/3902, Mean rms=4.139%, delta=44.184%, char train=100.231%, word train=99.446%, skip ratio=5%, wrote checkpoint. At iteration 3800/3800/4009, Mean rms=4.142%, delta=44.218%, char train=100.01%, word train=99.331%, skip ratio=4.9%, wrote checkpoint. At iteration 3900/3900/4115, Mean rms=4.144%, delta=44.237%, char train=99.734%, word train=99.172%, skip ratio=5.1%, wrote checkpoint. At iteration 4000/4000/4218, Mean rms=4.149%, delta=44.362%, char train=99.694%, word train=99.183%, skip ratio=4.9%, wrote checkpoint. At iteration 4100/4100/4323, Mean rms=4.148%, delta=44.448%, char train=99.468%, word train=99.07%, skip ratio=5.2%, wrote checkpoint. At iteration 4200/4200/4429, Mean rms=4.148%, delta=44.465%, char train=99.479%, word train=98.953%, skip ratio=5.6%, wrote checkpoint. At iteration 4300/4300/4532, Mean rms=4.143%, delta=44.299%, char train=99.545%, word train=98.916%, skip ratio=5.3%, wrote checkpoint. At iteration 4400/4400/4638, Mean rms=4.135%, delta=44.073%, char train=99.652%, word train=98.761%, skip ratio=4.8%, wrote checkpoint. At iteration 4500/4500/4742, Mean rms=4.133%, delta=43.927%, char train=99.721%, word train=98.706%, skip ratio=4.8%, wrote checkpoint. At iteration 4600/4600/4848, Mean rms=4.135%, delta=43.959%, char train=99.595%, word train=98.887%, skip ratio=4.8%, wrote checkpoint. At iteration 4700/4700/4954, Mean rms=4.138%, delta=43.839%, char train=99.536%, word train=98.98%, skip ratio=5.2%, wrote checkpoint. At iteration 4800/4800/5057, Mean rms=4.142%, delta=44.035%, char train=99.575%, word train=99.147%, skip ratio=4.8%, wrote checkpoint. At iteration 4900/4900/5163, Mean rms=4.145%, delta=44.12%, char train=99.596%, word train=99.201%, skip ratio=4.8%, wrote checkpoint. At iteration 5000/5000/5270, Mean rms=4.138%, delta=43.965%, char train=99.494%, word train=99.143%, skip ratio=5.2%, wrote checkpoint. At iteration 5100/5100/5370, Mean rms=4.134%, delta=43.829%, char train=99.382%, word train=99.177%, skip ratio=4.7%, wrote checkpoint. At iteration 5200/5200/5481, Mean rms=4.136%, delta=43.899%, char train=99.49%, word train=99.168%, skip ratio=5.2%, wrote checkpoint. At iteration 5300/5300/5586, Mean rms=4.134%, delta=43.705%, char train=99.306%, word train=99.164%, skip ratio=5.4%, wrote checkpoint. At iteration 5400/5400/5692, Mean rms=4.141%, delta=43.94%, char train=99.312%, word train=99.294%, skip ratio=5.4%, wrote checkpoint. At iteration 5500/5500/5800, Mean rms=4.133%, delta=43.764%, char train=99.425%, word train=99.305%, skip ratio=5.8%, wrote checkpoint. At iteration 5600/5600/5908, Mean rms=4.121%, delta=43.498%, char train=99.3%, word train=99.321%, skip ratio=6%, wrote checkpoint. At iteration 5700/5700/6015, Mean rms=4.116%, delta=43.338%, char train=99.397%, word train=99.399%, skip ratio=6.1%, wrote checkpoint. At iteration 5800/5800/6120, Mean rms=4.102%, delta=42.913%, char train=99.6%, word train=99.385%, skip ratio=6.3%, wrote checkpoint. At iteration 5900/5900/6226, Mean rms=4.091%, delta=42.585%, char train=99.651%, word train=99.443%, skip ratio=6.3%, wrote checkpoint. At iteration 6000/6000/6330, Mean rms=4.089%, delta=42.484%, char train=99.813%, word train=99.499%, skip ratio=6%, wrote checkpoint. At iteration 6100/6100/6436, Mean rms=4.089%, delta=42.464%, char train=100.107%, word train=99.586%, skip ratio=6.6%, wrote checkpoint. At iteration 6200/6200/6542, Mean rms=4.077%, delta=42.096%, char train=100.168%, word train=99.548%, skip ratio=6.1%, wrote checkpoint. At iteration 6300/6300/6647, Mean rms=4.08%, delta=42.266%, char train=100.419%, word train=99.51%, skip ratio=6.1%, wrote checkpoint. At iteration 6400/6400/6752, Mean rms=4.077%, delta=42.187%, char train=100.048%, word train=99.41%, skip ratio=6%, wrote checkpoint. At iteration 6500/6500/6859, Mean rms=4.076%, delta=42.108%, char train=99.679%, word train=99.418%, skip ratio=5.9%, wrote checkpoint. At iteration 6600/6600/6970, Mean rms=4.081%, delta=42.207%, char train=99.785%, word train=99.346%, skip ratio=6.2%, wrote checkpoint. At iteration 6700/6700/7079, Mean rms=4.071%, delta=42.103%, char train=99.803%, word train=99.263%, skip ratio=6.4%, wrote checkpoint. At iteration 6800/6800/7183, Mean rms=4.085%, delta=42.341%, char train=99.475%, word train=99.291%, skip ratio=6.3%, wrote checkpoint. At iteration 6900/6900/7289, Mean rms=4.084%, delta=42.431%, char train=99.601%, word train=99.328%, skip ratio=6.3%, wrote checkpoint. At iteration 7000/7000/7393, Mean rms=4.087%, delta=42.49%, char train=99.416%, word train=99.317%, skip ratio=6.3%, wrote checkpoint. At iteration 7100/7100/7503, Mean rms=4.082%, delta=42.324%, char train=98.974%, word train=99.349%, skip ratio=6.7%, wrote checkpoint. At iteration 7200/7200/7608, Mean rms=4.084%, delta=42.438%, char train=98.744%, word train=99.493%, skip ratio=6.6%, wrote checkpoint. At iteration 7300/7300/7712, Mean rms=4.09%, delta=42.561%, char train=98.512%, word train=99.571%, skip ratio=6.5%, wrote checkpoint. At iteration 7400/7400/7819, Mean rms=4.094%, delta=42.664%, char train=98.767%, word train=99.578%, skip ratio=6.7%, wrote checkpoint. At iteration 7500/7500/7926, Mean rms=4.097%, delta=42.695%, char train=98.876%, word train=99.634%, skip ratio=6.7%, wrote checkpoint. At iteration 7600/7600/8031, Mean rms=4.103%, delta=42.815%, char train=98.873%, word train=99.674%, skip ratio=6.1%, wrote checkpoint. At iteration 7700/7700/8136, Mean rms=4.104%, delta=42.788%, char train=99.083%, word train=99.701%, skip ratio=5.7%, wrote checkpoint. At iteration 7800/7800/8242, Mean rms=4.102%, delta=42.705%, char train=99.171%, word train=99.706%, skip ratio=5.9%, wrote checkpoint. At iteration 7900/7900/8351, Mean rms=4.11%, delta=42.824%, char train=99.208%, word train=99.7%, skip ratio=6.2%, wrote checkpoint. At iteration 8000/8000/8458, Mean rms=4.102%, delta=42.554%, char train=99.44%, word train=99.615%, skip ratio=6.5%, wrote checkpoint. At iteration 8100/8100/8562, Mean rms=4.098%, delta=42.474%, char train=99.523%, word train=99.554%, skip ratio=5.9%, wrote checkpoint. At iteration 8200/8200/8667, Mean rms=4.098%, delta=42.384%, char train=99.402%, word train=99.559%, skip ratio=5.9%, wrote checkpoint. At iteration 8300/8300/8772, Mean rms=4.096%, delta=42.411%, char train=99.509%, word train=99.559%, skip ratio=6%, wrote checkpoint. At iteration 8400/8400/8875, Mean rms=4.087%, delta=42.151%, char train=99.518%, word train=99.69%, skip ratio=5.6%, wrote checkpoint. At iteration 8500/8500/8983, Mean rms=4.083%, delta=42.059%, char train=99.541%, word train=99.625%, skip ratio=5.7%, wrote checkpoint. At iteration 8600/8600/9090, Mean rms=4.084%, delta=42.029%, char train=99.476%, word train=99.655%, skip ratio=5.9%, wrote checkpoint. At iteration 8700/8700/9193, Mean rms=4.093%, delta=42.315%, char train=99.205%, word train=99.712%, skip ratio=5.7%, wrote checkpoint. At iteration 8800/8800/9299, Mean rms=4.091%, delta=42.294%, char train=99.158%, word train=99.712%, skip ratio=5.7%, wrote checkpoint. At iteration 8900/8900/9404, Mean rms=4.088%, delta=42.305%, char train=98.977%, word train=99.689%, skip ratio=5.3%, wrote checkpoint. At iteration 9000/9000/9509, Mean rms=4.092%, delta=42.42%, char train=99.033%, word train=99.72%, skip ratio=5.1%, wrote checkpoint. At iteration 9100/9100/9613, Mean rms=4.092%, delta=42.345%, char train=99.031%, word train=99.699%, skip ratio=5.1%, wrote checkpoint. At iteration 9200/9200/9716, Mean rms=4.087%, delta=42.185%, char train=99.069%, word train=99.713%, skip ratio=4.9%, wrote checkpoint. At iteration 9300/9300/9818, Mean rms=4.081%, delta=41.897%, char train=98.873%, word train=99.643%, skip ratio=4.6%, wrote checkpoint. At iteration 9400/9400/9924, Mean rms=4.078%, delta=41.745%, char train=98.903%, word train=99.621%, skip ratio=4.9%, wrote checkpoint. At iteration 9500/9500/10032, Mean rms=4.081%, delta=41.896%, char train=98.824%, word train=99.607%, skip ratio=4.9%, wrote checkpoint. At iteration 9600/9600/10141, Mean rms=4.071%, delta=41.748%, char train=99.062%, word train=99.608%, skip ratio=5.1%, wrote checkpoint. At iteration 9700/9700/10246, Mean rms=4.065%, delta=41.575%, char train=98.917%, word train=99.576%, skip ratio=5.3%, wrote checkpoint. At iteration 9800/9800/10352, Mean rms=4.063%, delta=41.608%, char train=98.708%, word train=99.462%, skip ratio=5.3%, wrote checkpoint. At iteration 9900/9900/10455, Mean rms=4.06%, delta=41.485%, char train=98.831%, word train=99.502%, skip ratio=5.1%, wrote checkpoint. At iteration 10000/10000/10557, Mean rms=4.059%, delta=41.41%, char train=98.633%, word train=99.572%, skip ratio=4.8%, wrote checkpoint.

Looks like it starts with rather bad error rate, but drops this very fast an cannot recover. It only writes one checkpoint.

Has anybody tried this training data out by now?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/213, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I77WB6G6M5HUPJ7XM3SVBSOTANCNFSM4U5SECBQ .

M3ssman commented 3 years ago

@Shreeshrii Thanks for mentioning the additional debugging flag!

Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/fc87cdb3-3a6d-411c-96fd-f082718fef41.lstmf
Iteration 1003: GROUND  TRUTH : Les
Iteration 1003: ALIGNED TRUTH : LLees
Iteration 1003: BEST OCR TEXT : 
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/39107f39-ca6d-48cb-b86d-e190948c03df.lstmf line 0 :
Mean rms=4.128%, delta=42.174%, train=95.689%(99.758%), skip ratio=6%
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8e ffffffd8 ffffff8c 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff81 ffffffd9 ffffff82 ffffffd9 ffffff87 20 ffffffd8 ffffffb9 ffffffd9 ffffff84 ffffffd9 ffffff89 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff85 ffffffd8 ffffffb0 ffffffd8 ffffffa7 ffffffd9 ffffff87 ffffffd8 ffffffa8 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd8 ffffffb1 ffffffd8 ffffffa8 ffffffd8 ffffffb9 ffffffd8 ffffffa9 20 2e 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff82 ffffffd8 ffffffa7 ffffffd9 ffffff87 ffffffd8 ffffffb1 ffffffd8 ffffffa9 20 ffffffd8 ffffff8c 20 ffffffd9 ffffff85 ffffffd8 ffffffb7 ffffffd8 ffffffa7 ffffffd8 ffffffa8 ffffffd8 ffffffb9 20 ffffffd9 ffffff85 ffffffd8 ffffffae ffffffd8 ffffffaa ffffffd9 ffffff84 ffffffd9 ffffff81 ffffffd8 ffffffa9 20 2e 2e 2e
Can't encode transcription: '١٩٣٣ - ١٩٥٧ . ج ١ - ٢ ‎، الفقه على المذاهب الاربعة . القاهرة ، مطابع مختلفة ...' in language ''
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/f2608627-a417-4789-b91b-1d12d86db2b3.lstmf
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8f 2d 20 ffffffd9 ffffff88 ffffffd9 ffffff84 ffffffd9 ffffff8a ffffffd8 ffffffb3 20 ffffffd8 ffffffb9 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd8 ffffffac ffffffd9 ffffff87 ffffffd8 ffffffa7 20 ffffffd8 ffffffb3 ffffffd9 ffffff87 ffffffd9 ffffff84 ffffffd9 ffffff8b ffffffd8 ffffffa7 20 ffffffd9 ffffff8a ffffffd9 ffffff87 ffffffd9 ffffff86 ffffffd8 ffffffa7 20 ffffffd9 ffffff8b ffffffd9 ffffff83 ffffffd9 ffffff85 ffffffd8 ffffffa7 20Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/3e53c491-c906-46d7-b4ad-8d7710f08303.lstmf
 ffffffd9 ffffff86 ffffffd8 ffffffaa ffffffd9 ffffff88 ffffffd9 ffffff87 ffffffd9 ffffff85 20 2e 20 ffffffd9 ffffff81 ffffffd9 ffffff87 ffffffd9 ffffff88
Can't encode transcription: 'بخطر كبير - رغم تفاؤل المتفائلين ‏- وليس علاجهـا سهلًا يهنا ًكما نتوهم . فهو' in language ''
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/6bc98c3b-60d8-45b3-9e67-b8c7d2f75a66.lstmf
Iteration 1004: GROUND  TRUTH : ثكيرا ًمن سياسة الباب المفتوح الاقتصادية .
Iteration 1004: BEST OCR TEXT : 
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/0793c2c0-0ac1-4033-a17a-6dc51ea75d5b.lstmf line 0 :
Mean rms=4.127%, delta=42.157%, train=95.768%(99.758%), skip ratio=6.2%

Since the train ratio reached virtually 100 % already about iteration 1.000 and I keep iterating until 10.000, does this indicate overfitting?

There's no output for BEST OCR TEXT pretty early (I guess iteration 100). Is this a problem? Doesn't this indicate that lstmtrain couldn't find a line that matched the model best with a certain threshold?

The encoding errors occur quite regular (even within our new training data sets) - is this wrong? Shall I drop these datasets?

Shreeshrii commented 3 years ago

What is the command that you used to start training?

Iteration 1003: GROUND  TRUTH : Les
Iteration 1003: ALIGNED TRUTH : LLees
Iteration 1003: BEST OCR TEXT : 

This indicates that your groundtruth has text with [a-zA-Z]. If you want to include those for training use script/Arabic as START_MODEL else delete all datasets with the English text.

Can't encode transcription: this seems to be related to the Arabic accents which are not supported in ara.traineddata. You can look at the unicharsets files.

Shreeshrii commented 3 years ago

See attached log file (first 1000 lines) for the training for Persian/Farsi that I am trying right now.

fasPlus-1000.log

M3ssman commented 3 years ago
lstmtraining \
  --debug_interval 0 \
  --traineddata /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.traineddata \
  --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/ara.traineddata \
  --continue_from data/ara/gt4ara.lstm \
  --learning_rate 0.0001 \
  --model_output /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara \
  --train_listfile /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/list.train \
  --eval_listfile /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/list.eval \
  --max_iterations 10000 \
  --target_error_rate 0.01
Loaded file data/ara/gt4ara.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 85 to 194!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc194:194, 99522
Total weights = 1503586
Previous null char=2 mapped to 2
Continuing from data/ara/gt4ara.lstm

Anyway, I tried Arabic.traineddata, which in the end performed only slightly better: 6.792 (Arabic) vs. 6.575 (ara) (10.000 Iterations)

Please note the final outputs: (Arabic)

Iteration 9995: BEST OCR TEXT : ا
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/fe927986-2a28-4ada-a626-02181692279e.lstmf line 0 :
Mean rms=3.16%, delta=41.113%, train=99.125%(98.405%), skip ratio=4.8%
Iteration 9996: GROUND  TRUTH : ٢ - الصابون .
Iteration 9996: ALIGNED TRUTH : ٢  -  االصااببونن .
Iteration 9996: BEST OCR TEXT : ا.
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/9e69fb59-285b-4cb3-9ece-f8a1da5f3a40.lstmf line 0 :
Mean rms=3.16%, delta=41.114%, train=99.125%(98.405%), skip ratio=4.8%
Iteration 9997: GROUND  TRUTH : بل كان يستمد من تطوريته الكامنة القوة لاعادة التجانس بين الواقعين - الفكري
Iteration 9997: BEST OCR TEXT : ا
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/2f393a4f-9b80-4966-a2a2-e3d1cb8883ed.lstmf line 0 :
Mean rms=3.16%, delta=41.122%, train=99.126%(98.405%), skip ratio=4.8%
Iteration 9998: GROUND  TRUTH : الماوردي الاجتماعية التي تتسم بحركية وجدلية شديدتين يبرز من خلالهما الاتساق في الانقسام ،
Iteration 9998: BEST OCR TEXT : اة
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/020f6291-a654-4c6c-a094-ef76713b69a3.lstmf line 0 :
Mean rms=3.16%, delta=41.112%, train=99.125%(98.405%), skip ratio=4.8%
Iteration 9999: GROUND  TRUTH : ١٣٥ في ر : وقعت .
Iteration 9999: BEST OCR TEXT : ا .
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/da150fd1-2aed-4730-96ac-3c8d13229f37.lstmf line 0 :
Mean rms=3.16%, delta=41.096%, train=99.118%(98.389%), skip ratio=4.8%
At iteration 9996/10000/10557, Mean rms=3.16%, delta=41.096%, char train=99.118%, word train=98.389%, skip ratio=4.8%,  wrote checkpoint.

Finished! Error rate = 66.807
lstmtraining \
--stop_training \
--continue_from /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara_checkpoint \
--traineddata /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.traineddata \
--model_output /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara.traineddata
Loaded file /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara_checkpoint, unpacking...

real    57m57,094s
user    139m55,099s
sys 10m59,039s
[INFO] training finished at Mi 16. Dez 15:00:39 CET 2020

(ara)

Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8f 2d 20 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd9 ffffff87 ffffffd9 ffffff88 ffffffd8 ffffffb1 20 ffffffd8 ffffff8c 20 ffffffd8 ffffffaf ffffffd8 ffffffa7 ffffffd9 ffffff86 ffffffd8 ffffffb4 ffffffda ffffffa9 ffffffd8 ffffffa7 ffffffd9 ffffff87 20 ffffffd8 ffffffa8 ffffffd9 ffffff86 ffffffd8 ffffffac ffffffd8 ffffffa7 ffffffd8 ffffffa8 20 ffffffd8 ffffff8c 20 31 39 36 32 20 2e 20 ffffffd9 ffffffa2 20 ffffffd8 ffffffac 20 2e
Can't encode transcription: 'اردو دائرة معارف اسلامية ‏- لاهور ، دانشکاه بنجاب ، 1962 . ٢ ج .' in language ''
At iteration 9800/9800/10352, Mean rms=4.063%, delta=41.608%, char train=98.708%, word train=99.462%, skip ratio=5.3%,  wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8e 20 ffffffd8 ffffff8c 20 31 39 34 37
Can't encode transcription: 'Mifflin Company, 1 ص‎ ، 1947' in language ''
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8f 2d 20 ffffffd9 ffffff84 20 31 37 30 29 20 2e
Can't encode transcription: 'من اليونانية : galene ، بمعنى هدوء البحر (ن دوزي ، فريحة 124 ، نخلة ‏- ل 170) .' in language ''
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8f ffffffd8 ffffffb9 ffffffd9 ffffff87 ffffffd9 ffffff85 ffffffd8 ffffffa7 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff85 ffffffd8 ffffffab ffffffd9 ffffff84 20 ffffffd8 ffffff8c 20 ffffffd9 ffffff88 ffffffd8 ffffffaa ffffffd8 ffffffb5 ffffffd8 ffffffa7 ffffffd8 ffffffa8 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd8 ffffffb1 ffffffd8 ffffffa7 ffffffd8 ffffffaf ffffffd8 ffffffa9 20 ffffffd8 ffffffa8 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd8 ffffffb4 ffffffd9 ffffff84 ffffffd9 ffffff84 20 2e
Can't encode transcription: 'عميقة ، كثيراً ما تختفی م‏عهما المثل ، وتصاب الارادة بالشلل .' in language ''
At iteration 9900/9900/10455, Mean rms=4.06%, delta=41.485%, char train=98.831%, word train=99.502%, skip ratio=5.1%,  wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8f ffffffd8 ffffffa7 ffffffd9 ffffff88 20 53 69 6c 62 65 72 6d 61 6e 20 28 ffffffd8 ffffffb1 ffffffd8 ffffffac ffffffd9 ffffff84 20 ffffffd9 ffffff81 ffffffd8 ffffffb6 ffffffd8 ffffffa9 29 20 ffffffd8 ffffffa7 ffffffd9 ffffff88 20 73 65 6d 61 6e ffffffe2 ffffff80 ffffff8f 77 69 20 28 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd8 ffffffb1 ffffffd8 ffffffac ffffffd9 ffffff84 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd8 ffffffad ffffffda ffffffa9 ffffffd9 ffffff8a ffffffd9 ffffff85 20 2d 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd8 ffffffad ffffffd8 ffffffa7 ffffffd8 ffffffae ffffffd8 ffffffa7 ffffffd9 ffffff85 29 20 ffffffd8 ffffff8c 20 ffffffd8 ffffffa7 ffffffd8 ffffffb0 ffffffd8 ffffffa7
Can't encode transcription: 'الذهب) ‏او Silberman (رجل فضة) او seman‏wi (الرجل الحکيم - الحاخام) ، اذا' in language ''
Encoding of string failed! Failure bytes: ffffffe2 ffffff80 ffffff8f 62 65 72 67 20 2c 66 65 6c 64 20 2c 66 69 65 6c 64 20 2c 68 65 69 6d 20 2c 68 6f 75 73 2e
Can't encode transcription: 'المركبة من e‏berg ,feld ,field ,heim ,hous.' in language ''
At iteration 10000/10000/10557, Mean rms=4.059%, delta=41.41%, char train=98.633%, word train=99.572%, skip ratio=4.8%,  wrote checkpoint.

Finished! Error rate = 52.921
lstmtraining \
--stop_training \
--continue_from /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara_checkpoint \
--traineddata /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.traineddata \
--model_output /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara.traineddata
Loaded file /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara_checkpoint, unpacking...

real    80m55,483s
user    216m9,250s
sys 11m17,904s
[INFO] training finished at Mi 16. Dez 08:23:59 CET 2020
Shreeshrii commented 3 years ago

JSTORArabic training data has

It also has the Arabic numerals as well as the Farsi numerals. This will cause confusion as many of these look the same.

215 ٠ ARABIC-INDIC DIGIT ZERO 751 ١ ARABIC-INDIC DIGIT ONE 328 ٢ ARABIC-INDIC DIGIT TWO 280 ٣ ARABIC-INDIC DIGIT THREE 292 ٤ ARABIC-INDIC DIGIT FOUR 284 ٥ ARABIC-INDIC DIGIT FIVE 261 ٦ ARABIC-INDIC DIGIT SIX 123 ٧ ARABIC-INDIC DIGIT SEVEN 155 ٨ ARABIC-INDIC DIGIT EIGHT 328 ٩ ARABIC-INDIC DIGIT NINE

32 ۰ EXTENDED ARABIC-INDIC DIGIT ZERO 156 ۱ EXTENDED ARABIC-INDIC DIGIT ONE 170 ۲ EXTENDED ARABIC-INDIC DIGIT TWO 188 ۳ EXTENDED ARABIC-INDIC DIGIT THREE 4 ۵ EXTENDED ARABIC-INDIC DIGIT FIVE 101 ۷ EXTENDED ARABIC-INDIC DIGIT SEVEN 103 ۸ EXTENDED ARABIC-INDIC DIGIT EIGHT 98 ۹ EXTENDED ARABIC-INDIC DIGIT NINE

Shreeshrii commented 3 years ago

@M3ssman While I have seen cases of increasing CER rates as you have indicated, I am not able to reproduce it with the JSTORArabic data.

I started the training using the makefile with

nohup make LANG_TYPE=RTL MODEL_NAME=JSTORArabic PSM=13 START_MODEL=ara TESSDATA=$HOME/tessdata_best MAX_ITERATIONS=9999999 DEBUG_INTERVAL=-1 training >> data/JSTORArabic.log &

Resulting training command was:

lstmtraining \
  --debug_interval -1 \
  --traineddata data/JSTORArabic/JSTORArabic.traineddata \
  --old_traineddata /home/ubuntu/tessdata_best/ara.traineddata \
  --continue_from data/ara/JSTORArabic.lstm \
  --learning_rate 0.0001 \
  --model_output data/JSTORArabic/checkpoints/JSTORArabic \
  --train_listfile data/JSTORArabic/list.train \
  --eval_listfile data/JSTORArabic/list.eval \
  --max_iterations 9999999 \
  --target_error_rate 0.01
Loaded file data/ara/JSTORArabic.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 85 to 126!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx512:512, 1247232
  Fc126:126, 64638
Total weights = 1468702
Previous null char=2 mapped to 2
Continuing from data/ara/JSTORArabic.lstm
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/768ff5a2-cc9e-4d85-9e2c-2835e1f2f0f0.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/58d04709-ed4e-4ae0-a647-7c18a306e5cf.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/4f16ff99-9bc4-4a23-abc4-ca9a20012218.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/378c1c7b-5b7e-435e-8e69-ca6f78733b80.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/fbd5c3a4-c620-4860-8bbb-36e480e58869.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/f5547321-94c6-4a90-bbe4-90436c1ce35c.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/1bd96fb8-6385-4829-a4a8-d59e426a8368.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/3bd2358e-7625-4093-afca-5203e7eea89f.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/747156fc-a9d8-4a87-83bd-5c94b90962cd.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/850ada37-843e-4d17-9267-9133dcf0aa6b.lstmf
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/8e266cee-c30d-4883-91f2-73508d45e1ff.lstmf
Iteration 0: GROUND  TRUTH : سلجملا بتكم مهنم فلأتيو يرسلا عارتقالاب نيبقارم ةثالثو رسلل نينيماو هيبئانو
Iteration 0: BEST OCR TEXT : سلجملا بتكم مهنم فاأتيو ييرسلا عارتقالاب نيبقار« ةثالثو رسلل نينيماو هيبئانو
File data/JSTORArabic-ground-truth/768ff5a2-cc9e-4d85-9e2c-2835e1f2f0f0.lstmf line 0 :
Mean rms=1.338%, delta=1.908%, train=6.667%(27.273%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/a82f81db-adee-4c6f-b0b4-369d725288b5.lstmf
Iteration 1: GROUND  TRUTH : ثاحبألا
Iteration 1: BEST OCR TEXT : تاهنلا
File data/JSTORArabic-ground-truth/58d04709-ed4e-4ae0-a647-7c18a306e5cf.lstmf line 0 :
Mean rms=2.246%, delta=6.837%, train=53.333%(63.636%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/81611100-b634-4015-b07e-6a3785650890.lstmf
Iteration 2: GROUND  TRUTH : يضارا نم ربكالا ءزجلا نا نع ًامغر اذه - ةيجراخلا قاوسالا يف ماه نأش ايروس
Iteration 2: BEST OCR TEXT : يضارا نم ريكالا ءزجلا نا نع ًامغر اذه - ةيجراخلا قاوسألا يف ماه نأش ايروس
File data/JSTORArabic-ground-truth/fbd5c3a4-c620-4860-8bbb-36e480e58869.lstmf line 0 :
Mean rms=1.969%, delta=5.442%, train=37.382%(46.869%), skip ratio=0%
Loaded 1/1 lines (1-1) of document data/JSTORArabic-ground-truth/54233b12-283f-401e-bccd-749562442f17.lstmf
Iteration 3: GROUND  TRUTH : . یرخا نادلب يف اهتالیثم راعسا نع دیزت راعساب مهتاجوتنم ءارشو ، راحبلا ربع
Iteration 3: BEST OCR TEXT : . ىرخأ نادلب يف اهتاليثم راعسأ نع ديزت راعساب مهتاجوتنم ءارشو » راحبلا ربع
File data/JSTORArabic-ground-truth/3bd2358e-7625-4093-afca-5203e7eea89f.lstmf line 0 :

I am getting output similar to the following around 10000 iterations.

Iteration 9996: GROUND  TRUTH : . ص 654 . 1958 ، برعلا نيماحملا
Iteration 9996: BEST OCR TEXT : . ص 654 . ١958 ، برعلا نيماحملا
File data/JSTORArabic-ground-truth/d3c1bd95-eb04-4e69-99e8-5746e78634c9.lstmf line 0 :
Mean rms=0.983%, delta=2.504%, train=10.042%(14.769%), skip ratio=4.4%
Iteration 9997: GROUND  TRUTH : . لخادلا يف دوجوم ريغ هنأبً اناذاي ةكبشب
Iteration 9997: BEST OCR TEXT : . لخادلا يف دوجوم ريغ هنأب ًاناذيا ةكبشب
File data/JSTORArabic-ground-truth/5701175c-6cc6-40aa-a79c-c08d54091e89.lstmf line 0 :
Mean rms=0.985%, delta=2.51%, train=10.042%(14.794%), skip ratio=4.4%
Iteration 9998: GROUND  TRUTH : ناو . حيحص يطارقميد ماظن لک اهظفحي نا بجيو بعشلل ةماعلا قوقحلا نم يه
Iteration 9998: BEST OCR TEXT : ناو . حيحص يطارقميد ماظن لك اهظفحي نا بجيو بعشلل ةماعلا قوقحلا نم يه
File data/JSTORArabic-ground-truth/d9af74aa-d3b7-425f-a0f6-02a4554022ac.lstmf line 0 :
Mean rms=0.985%, delta=2.51%, train=10.044%(14.801%), skip ratio=4.4%
Iteration 9999: GROUND  TRUTH : قرشلا يف ةيئاشنالا عيراشملاب مايقلا نكمي ال هنا ، ةثعبلا ىرت امک ، یرن اننا عمو
Iteration 9999: BEST OCR TEXT : قرشلا يف ةيئاشنالا عيراشملاب مايقلا نكمي ال هنا ، ةثعبلا ىرت ام ، ىزرن اننا عمو
File data/JSTORArabic-ground-truth/f1c40dc7-f7db-4182-aea6-3a24d509a380.lstmf line 0 :
Mean rms=0.986%, delta=2.513%, train=10.05%(14.814%), skip ratio=4.3%
At iteration 7441/10000/10441, Mean rms=0.986%, delta=2.513%, char train=10.05%, word train=14.814%, skip ratio=4.3%,  New worst char error = 10.05 wrote checkpoint.

Iteration 10000: GROUND  TRUTH : نم لوالا فصنلا لالخ نارهط يف ةصاخ ةرود يف ىرخا ةرم عمتجي نا ررق سلجملا - 19
Iteration 10000: BEST OCR TEXT : نم لوالا فصنلا لالخ نارهط يف ةصاخ ةرود يف ىرخا ةرم عمتجي نا ررق ساجملا - 1
File data/JSTORArabic-ground-truth/53cfd2ef-a2ec-4d57-8fb0-542e4164297d.lstmf line 0 :
Mean rms=0.987%, delta=2.515%, train=10.054%(14.826%), skip ratio=4.3%
Iteration 10001: GROUND  TRUTH : يف اما . فوشلا ادع ام ةيزردلا قطانملا عيمج يفً ادعد زوردلا ةنراوملا قاف ىتح
Iteration 10001: ALIGNED TRUTH : يف اما . فوشلا ادع ام ةيزردلا قطانملا عيمج يفً اددعد زوردلا ةنراوملا قاف ىتح
Iteration 10001: BEST OCR TEXT : يف اما . فوشلا ادع ام ةيزردلا قطانملا عيمج يف اددع زوردلا ةنراوملا قاف ىتح
File data/JSTORArabic-ground-truth/93e3a841-a14f-46bd-890f-6507a2adb8bb.lstmf line 0 :
Mean rms=0.986%, delta=2.511%, train=10.044%(14.806%), skip ratio=4.2%
Iteration 10002: GROUND  TRUTH : ثاحبالاو تاكوكسملا ريدم يدنبشقنلا رصان موحرملا هبتک لاقم رخآ اذه يلي
Iteration 10002: BEST OCR TEXT : ثاحبالاو تاكوكسملا ريدم يدنبشقنلا رصان موحرملا هبتك لاقم رخآ اذه يلب
File data/JSTORArabic-ground-truth/8094f3ec-5394-43cf-8783-9619fdfdb138.lstmf line 0 :
Mean rms=0.985%, delta=2.511%, train=10.036%(14.792%), skip ratio=4.1%
Iteration 10003: GROUND  TRUTH : فيلأت اشاب ساحنلا ىفطصم ةعفر قوراف كلملا ةلالج فلكو ةلاقتسالا تلبقف قوراف
File data/JSTORArabic-ground-truth/626c9db5-0e63-4945-a658-f1ff1fd398fa.lstmf line 0 :
Mean rms=0.985%, delta=2.511%, train=10.036%(14.792%), skip ratio=4.1%
Iteration 10004: GROUND  TRUTH : فورظو عاضوا يف ملاعلا اذه ىلا رظنن نا انيلع موتحملا نم حبصا دق نکلو
Iteration 10004: BEST OCR TEXT : فورظو عاضوا يف ملاعلا اذه ىلا رظنن نا انيلع موتحملا نم حبصا دق نكلو
File data/JSTORArabic-ground-truth/db7908f2-a3e6-4230-b242-ae220cc13bd2.lstmf line 0 :

This is what the CER plot looks like currently.

JSTORArabic-validate-cer

M3ssman commented 3 years ago

I'm using reagular Makefile-Workflow. There seems to be some issue with the local available charsets:

combine_lang_model \
  --input_unicharset /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/unicharset \
  --script_dir data \
  --numbers /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.numbers \
  --puncs /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.punc \
  --words /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.wordlist \
  --output_dir data \
  --pass_through_recoder --lang_is_rtl \
  --lang gt4ara
Failed to read data from: /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.wordlist
Failed to read data from: /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.punc
Failed to read data from: /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.numbers
Loaded unicharset of size 319 from file /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/unicharset
Setting unichar properties
Other case À of à is not in unicharset
Other case Ā of ā is not in unicharset
Other case Ô of ô is not in unicharset
Other case Ï of ï is not in unicharset
Setting script properties
Failed to load script unicharset from:data/Arabic.unicharset
Failed to load script unicharset from:data/Inherited.unicharset
Failed to load script unicharset from:data/Latin.unicharset
Failed to load script unicharset from:data/Hebrew.unicharset
Warning: properties incomplete for index 3 = +
Warning: properties incomplete for index 4 = 3
Warning: properties incomplete for index 5 = "
Warning: properties incomplete for index 6 = ذ
Warning: properties incomplete for index 7 = ك
Warning: properties incomplete for index 8 = ر
Warning: properties incomplete for index 9 = ي
Warning: properties incomplete for index 10 = ا
Warning: properties incomplete for index 11 = (
Warning: properties incomplete for index 12 = .
Warning: properties incomplete for index 13 = )
Warning: properties incomplete for index 14 = ل
Warning: properties incomplete for index 15 = م
Warning: properties incomplete for index 16 = َ
Warning: properties incomplete for index 17 = ئ
Warning: properties incomplete for index 18 = ة
Warning: properties incomplete for index 19 = #
Warning: properties incomplete for index 20 = ُ
Warning: properties incomplete for index 21 = خ
Warning: properties incomplete for index 22 = ع
Warning: properties incomplete for index 23 = ض
Warning: properties incomplete for index 24 = و
Warning: properties incomplete for index 25 = “
Warning: properties incomplete for index 26 = ”
Warning: properties incomplete for index 27 = _
Warning: properties incomplete for index 28 = ب
Warning: properties incomplete for index 29 = غ
Warning: properties incomplete for index 30 = ن
Warning: properties incomplete for index 31 = ه
Warning: properties incomplete for index 32 = آ
Warning: properties incomplete for index 33 = !
Warning: properties incomplete for index 34 = [
Warning: properties incomplete for index 35 = <
Warning: properties incomplete for index 36 = 8
Warning: properties incomplete for index 37 = 1
Warning: properties incomplete for index 38 = 0
Warning: properties incomplete for index 39 = 7
Warning: properties incomplete for index 40 = 6
Warning: properties incomplete for index 41 = س
Warning: properties incomplete for index 42 = ق
Warning: properties incomplete for index 43 = ط
Warning: properties incomplete for index 44 = ٍ
Warning: properties incomplete for index 45 = أ
Warning: properties incomplete for index 46 = ت
Warning: properties incomplete for index 47 = د
Warning: properties incomplete for index 48 = ّ
Warning: properties incomplete for index 49 = ج
Warning: properties incomplete for index 50 = ش
Warning: properties incomplete for index 51 = ف
Warning: properties incomplete for index 52 = ,
Warning: properties incomplete for index 53 = ح
Warning: properties incomplete for index 54 = -
Warning: properties incomplete for index 55 = «
Warning: properties incomplete for index 56 = إ
Warning: properties incomplete for index 57 = 4
Warning: properties incomplete for index 58 = 2
Warning: properties incomplete for index 59 = 9
Warning: properties incomplete for index 60 = ص
Warning: properties incomplete for index 61 = ث
Warning: properties incomplete for index 62 = '
Warning: properties incomplete for index 63 = ى
Warning: properties incomplete for index 64 = ز
Warning: properties incomplete for index 65 = ِ
Warning: properties incomplete for index 66 = ْ
Warning: properties incomplete for index 67 = *
Warning: properties incomplete for index 68 = ؤ
Warning: properties incomplete for index 69 = ً
Warning: properties incomplete for index 70 = ء
Warning: properties incomplete for index 71 = ٌ
Warning: properties incomplete for index 72 = 5
Warning: properties incomplete for index 73 = |
Warning: properties incomplete for index 74 = ؟
Warning: properties incomplete for index 75 = ظ
Warning: properties incomplete for index 76 = :
Warning: properties incomplete for index 77 = »
Warning: properties incomplete for index 78 = /
Warning: properties incomplete for index 79 = ؛
Warning: properties incomplete for index 80 = ]
Warning: properties incomplete for index 81 = >
Warning: properties incomplete for index 82 = ٠
Warning: properties incomplete for index 83 = ١
Warning: properties incomplete for index 84 = ©
Warning: properties incomplete for index 85 = ی
Warning: properties incomplete for index 86 = ‌
Warning: properties incomplete for index 87 = پ
Warning: properties incomplete for index 88 = گ
Warning: properties incomplete for index 89 = ۱
Warning: properties incomplete for index 90 = ۳
Warning: properties incomplete for index 91 = ۴
Warning: properties incomplete for index 92 = ۷
Warning: properties incomplete for index 93 = ۹
Warning: properties incomplete for index 94 = ۰
Warning: properties incomplete for index 95 = ۸
Warning: properties incomplete for index 96 = ک
Warning: properties incomplete for index 97 = ۵
Warning: properties incomplete for index 98 = ۲
Warning: properties incomplete for index 99 = چ
Warning: properties incomplete for index 100 = ۶
Warning: properties incomplete for index 101 = ژ
Warning: properties incomplete for index 102 = ٩
Warning: properties incomplete for index 103 = E
Warning: properties incomplete for index 104 = M
Warning: properties incomplete for index 105 = K
Warning: properties incomplete for index 106 = É
Warning: properties incomplete for index 107 = I
Warning: properties incomplete for index 108 = H
Warning: properties incomplete for index 109 = N
Warning: properties incomplete for index 110 = D
Warning: properties incomplete for index 111 = Z
Warning: properties incomplete for index 112 = Y
Warning: properties incomplete for index 113 = A
Warning: properties incomplete for index 114 = V
Warning: properties incomplete for index 115 = R
Warning: properties incomplete for index 116 = ;
Warning: properties incomplete for index 117 = S
Warning: properties incomplete for index 118 = Ê
Warning: properties incomplete for index 119 = Û
Warning: properties incomplete for index 120 = Ş
Warning: properties incomplete for index 121 = Î
Warning: properties incomplete for index 122 = L
Warning: properties incomplete for index 123 = T
Warning: properties incomplete for index 124 = C
Warning: properties incomplete for index 125 = X
Warning: properties incomplete for index 126 = U
Warning: properties incomplete for index 127 = B
Warning: properties incomplete for index 128 = O
Warning: properties incomplete for index 129 = Ç
Warning: properties incomplete for index 130 = G
Warning: properties incomplete for index 131 = F
Warning: properties incomplete for index 132 = P
Warning: properties incomplete for index 133 = W
Warning: properties incomplete for index 134 = J
Warning: properties incomplete for index 135 = Q
Warning: properties incomplete for index 136 = Ü
Warning: properties incomplete for index 137 = È
Warning: properties incomplete for index 138 = ~
Warning: properties incomplete for index 139 = \
Warning: properties incomplete for index 140 = „
Warning: properties incomplete for index 141 = ^
Warning: properties incomplete for index 142 = Ö
Warning: properties incomplete for index 143 = Ù
Warning: properties incomplete for index 144 = Ğ
Warning: properties incomplete for index 145 = Š
Warning: properties incomplete for index 146 = Ã
Warning: properties incomplete for index 147 = Ë
Warning: properties incomplete for index 148 = ®
Warning: properties incomplete for index 149 = &
Warning: properties incomplete for index 150 = @
Warning: properties incomplete for index 151 = ?
Warning: properties incomplete for index 152 = İ
Warning: properties incomplete for index 153 = $
Warning: properties incomplete for index 154 = §
Warning: properties incomplete for index 155 = Þ
Warning: properties incomplete for index 156 = %
Warning: properties incomplete for index 157 = `
Warning: properties incomplete for index 158 = €
Warning: properties incomplete for index 159 = ¬
Warning: properties incomplete for index 160 = s
Warning: properties incomplete for index 161 = ê
Warning: properties incomplete for index 162 = y
Warning: properties incomplete for index 163 = e
Warning: properties incomplete for index 164 = m
Warning: properties incomplete for index 165 = î
Warning: properties incomplete for index 166 = n
Warning: properties incomplete for index 167 = k
Warning: properties incomplete for index 168 = û
Warning: properties incomplete for index 169 = r
Warning: properties incomplete for index 170 = d
Warning: properties incomplete for index 171 = a
Warning: properties incomplete for index 172 = u
Warning: properties incomplete for index 173 = h
Warning: properties incomplete for index 174 = i
Warning: properties incomplete for index 175 = l
Warning: properties incomplete for index 176 = w
Warning: properties incomplete for index 177 = t
Warning: properties incomplete for index 178 = o
Warning: properties incomplete for index 179 = z
Warning: properties incomplete for index 180 = g
Warning: properties incomplete for index 181 = b
Warning: properties incomplete for index 182 = j
Warning: properties incomplete for index 183 = ç
Warning: properties incomplete for index 184 = x
Warning: properties incomplete for index 185 = ù
Warning: properties incomplete for index 186 = c
Warning: properties incomplete for index 187 = q
Warning: properties incomplete for index 188 = ş
Warning: properties incomplete for index 189 = þ
Warning: properties incomplete for index 190 = v
Warning: properties incomplete for index 191 = ı
Warning: properties incomplete for index 192 = p
Warning: properties incomplete for index 193 = f
Warning: properties incomplete for index 194 = ö
Warning: properties incomplete for index 195 = è
Warning: properties incomplete for index 196 = ü
Warning: properties incomplete for index 197 = é
Warning: properties incomplete for index 198 = š
Warning: properties incomplete for index 199 = ë
Warning: properties incomplete for index 200 = ğ
Warning: properties incomplete for index 201 = ã
Warning: properties incomplete for index 202 = ‫
Warning: properties incomplete for index 203 = ‪
Warning: properties incomplete for index 204 = ‬
Warning: properties incomplete for index 205 = ګ
Warning: properties incomplete for index 206 = ڼ
Warning: properties incomplete for index 207 = ټ
Warning: properties incomplete for index 208 = ډ
Warning: properties incomplete for index 209 = ړ
Warning: properties incomplete for index 210 = ږ
Warning: properties incomplete for index 211 = ې
Warning: properties incomplete for index 212 = ۀ
Warning: properties incomplete for index 213 = ۍ
Warning: properties incomplete for index 214 = ښ
Warning: properties incomplete for index 215 = څ
Warning: properties incomplete for index 216 = ھ
Warning: properties incomplete for index 217 = ٨
Warning: properties incomplete for index 218 = ٣
Warning: properties incomplete for index 219 = ٢
Warning: properties incomplete for index 220 = ے
Warning: properties incomplete for index 221 = ځ
Warning: properties incomplete for index 222 = ٧
Warning: properties incomplete for index 223 = ٥
Warning: properties incomplete for index 224 = ٤
Warning: properties incomplete for index 225 = ڪ
Warning: properties incomplete for index 226 = ڄ
Warning: properties incomplete for index 227 = ڙ
Warning: properties incomplete for index 228 = ڀ
Warning: properties incomplete for index 229 = ٹ
Warning: properties incomplete for index 230 = ؿ
Warning: properties incomplete for index 231 = ٿ
Warning: properties incomplete for index 232 = ٴ
Warning: properties incomplete for index 233 = ڍ
Warning: properties incomplete for index 234 = ڌ
Warning: properties incomplete for index 235 = ٽ
Warning: properties incomplete for index 236 = ڏ
Warning: properties incomplete for index 237 = ڊ
Warning: properties incomplete for index 238 = ڻ
Warning: properties incomplete for index 239 = ہ
Warning: properties incomplete for index 240 = ڱ
Warning: properties incomplete for index 241 = ڳ
Warning: properties incomplete for index 242 = ڇ
Warning: properties incomplete for index 243 = ؾ
Warning: properties incomplete for index 244 = ٻ
Warning: properties incomplete for index 245 = ٺ
Warning: properties incomplete for index 246 = ڃ
Warning: properties incomplete for index 247 = ٬
Warning: properties incomplete for index 248 = ٫
Warning: properties incomplete for index 249 = ڎ
Warning: properties incomplete for index 250 = ڂ
Warning: properties incomplete for index 251 = ٰ
Warning: properties incomplete for index 252 = =
Warning: properties incomplete for index 253 = ۔
Warning: properties incomplete for index 254 = ڦ
Warning: properties incomplete for index 255 = ۾
Warning: properties incomplete for index 256 = ؼ
Warning: properties incomplete for index 257 = ٳ
Warning: properties incomplete for index 258 = ۽
Warning: properties incomplete for index 259 = ؽ
Warning: properties incomplete for index 260 = ڑ
Warning: properties incomplete for index 261 = ۇ
Warning: properties incomplete for index 262 = ڈ
Warning: properties incomplete for index 263 = ˆ
Warning: properties incomplete for index 264 = ۋ
Warning: properties incomplete for index 265 = †
Warning: properties incomplete for index 266 = ە
Warning: properties incomplete for index 267 = ›
Warning: properties incomplete for index 268 = £
Warning: properties incomplete for index 269 = °
Warning: properties incomplete for index 270 = ×
Warning: properties incomplete for index 271 = ‹
Warning: properties incomplete for index 272 = ¥
Warning: properties incomplete for index 273 = ‰
Warning: properties incomplete for index 274 = ڭ
Warning: properties incomplete for index 275 = ±
Warning: properties incomplete for index 276 = ¦
Warning: properties incomplete for index 277 = ⁄
Warning: properties incomplete for index 278 = ¢
Warning: properties incomplete for index 279 = ػ
Warning: properties incomplete for index 280 = ۆ
Warning: properties incomplete for index 281 = ۈ
Warning: properties incomplete for index 282 = ێ
Warning: properties incomplete for index 283 = ۅ
Warning: properties incomplete for index 284 = ۉ
Warning: properties incomplete for index 285 = ڕ
Warning: properties incomplete for index 286 = ڵ
Warning: properties incomplete for index 287 = ¡
Warning: properties incomplete for index 288 = ¶
Warning: properties incomplete for index 289 = ¿
Warning: properties incomplete for index 290 = ‡
Warning: properties incomplete for index 291 = ں
Warning: properties incomplete for index 292 = ۃ
Warning: properties incomplete for index 293 = ٗ
Warning: properties incomplete for index 294 = ٭
Warning: properties incomplete for index 295 = ٔ
Warning: properties incomplete for index 296 = ٛ
Warning: properties incomplete for index 297 = ۂ
Warning: properties incomplete for index 298 = ۓ
Warning: properties incomplete for index 299 = ٘
Warning: properties incomplete for index 300 = ٦
Warning: properties incomplete for index 301 = ’
Warning: properties incomplete for index 302 = ™
Warning: properties incomplete for index 303 = {
Warning: properties incomplete for index 304 = }
Warning: properties incomplete for index 305 = —
Warning: properties incomplete for index 306 = ‘
Warning: properties incomplete for index 307 = ،
Warning: properties incomplete for index 308 = ٪
Warning: properties incomplete for index 309 = ❊
Warning: properties incomplete for index 310 = √
Warning: properties incomplete for index 311 = à
Warning: properties incomplete for index 312 = ā
Warning: properties incomplete for index 313 = ש
Warning: properties incomplete for index 314 = ב
Warning: properties incomplete for index 315 = ע
Warning: properties incomplete for index 316 = ô
Warning: properties incomplete for index 317 = ڤ
Warning: properties incomplete for index 318 = ï
Config file is optional, continuing...
Failed to read data from: data/gt4ara/gt4ara.config
lstmtraining \
  --debug_interval -1 \
  --traineddata /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/gt4ara.traineddata \
  --old_traineddata /usr/share/tesseract-ocr/4.00/tessdata/Arabic.traineddata \
  --continue_from data/Arabic/gt4ara.lstm \
  --learning_rate 0.0001 \
  --model_output /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/checkpoints/gt4ara \
  --train_listfile /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/list.train \
  --eval_listfile /home/hartwig/Projekte/work/mlu/ulb/ulb-dd-ocr-training-fid/tesstrain/data/gt4ara/list.eval \
  --max_iterations 100 \
  --target_error_rate 0.01
Loaded file data/Arabic/gt4ara.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 307 to 319!
Num (Extended) outputs,weights in Series:
  1,48,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  Lfys64:64, 20736
  Lfx96:96, 61824
  Lrx96:96, 74112
  Lfx384:384, 738816
  Fc319:319, 122815
Total weights = 1018463
Previous null char=2 mapped to 2
Continuing from data/Arabic/gt4ara.lstm
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/07a601df-3b87-40a8-9e1a-1a194acba063.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/9b47f595-537e-4596-83b1-72547e9228e2.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/1ad056c1-ea3d-4d26-80de-c352516e8360.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/d85a47f7-c556-448a-ad8b-cf5079ee3214.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/2158d4b2-d469-49d8-95e5-e6a89efc8df7.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/4587c0dc-814c-462b-9370-0930f4ce33d7.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/123bfe98-0941-4d8f-977f-5213432d88c6.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/d96fa3d9-8e3a-442f-a8bf-58bdcd2cf064.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/62fb71fe-b9f3-43a6-ad81-5db85e4c9e54.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/7c852c86-5cad-474d-b1ed-533acd9554fd.lstmf
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/a99968ee-b3c1-4e6b-8a08-39d44ef535da.lstmf
Iteration 0: GROUND  TRUTH : وقد يبدو هذا الامر بديهياً ، بل يجب ان يكون كذلك . غير انه لسؤ الحظ ،
Iteration 0: ALIGNED TRUTH : وقدد يبدو ها الامر بدهياً ، ببل يجب ان يكون كذلك . غيغير انه للسؤ الحظ،
Iteration 0: BEST OCR TEXT : ءظطا ؤسل هنا ريغ ,كلذك نوكي نا بحي لب » ًايهيدب رمالا اذه وديي دقو
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/07a601df-3b87-40a8-9e1a-1a194acba063.lstmf line 0 :
Mean rms=3.937%, delta=65.726%, train=18.841%(100%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/16855714-c824-4c45-97dc-73fed3652b54.lstmf
Iteration 1: GROUND  TRUTH : والمحافظة عليه ، والذين قد يلزمون لمساعدة اطراف اتفاقات الهدنة في الاشراف على تطببق نصوص
Iteration 1: ALIGNED TRUTH : والمحافظة عليه ، والذين ققديزمون لمساعدةة اطراف افاقات تففاقات الهدنة في الاشراف عى عطبلى تطببق نصنصوص
Iteration 1: BEST OCR TEXT : صوصث قسطت ىلع فارثالا يف ةندحلا تاقافتأ فارطا ةدعاسل تومزاي دق نيذلاو » هيلع ةظفاحلاو
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/1ad056c1-ea3d-4d26-80de-c352516e8360.lstmf line 0 :
Mean rms=3.939%, delta=65.407%, train=19.079%(100%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/2643311f-fa00-428c-b57d-ad794202dbcf.lstmf
Iteration 2: GROUND  TRUTH : تواجه الادارة المالية والتي يقرر شكلها في النهاية ما تفرضه الحكمة وخبرة المديرين .
Iteration 2: ALIGNED TRUTH : تاجه الادارةالمالية ولتي يقرر شكلها في النهاية ما  تفرضه الححكمة خبروخب لارة لاملممديرينن ..
Iteration 2: BEST OCR TEXT : .نريدملا ةربخو ةمكلا هضرفت ام ةراهنلا‌ىف املكش ررقي ىتلاو ةيلاملا ةرادالا هحاو
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/123bfe98-0941-4d8f-977f-5213432d88c6.lstmf line 0 :
Mean rms=3.932%, delta=64.917%, train=18.411%(100%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/574cddd1-093c-485e-9802-418e8785c2e7.lstmf
Iteration 3: GROUND  TRUTH : عليها خير الشعب . ونذکر هنا بهذه المناسبة الاجراءات التي اتخذتها الحكومة
Iteration 3: ALIGNED TRUTH : عليها خيراشعب  ونذر هنا بهذه الماسبة الاجراءات التيي اتخخذتها ها لحكا اللحومة
Iteration 3: BEST OCR TEXT : ةموكللأ امجذتا يلا تاءارجلالا ةيسانلا هذهب انه رك ذنو . بعشلا ريخ اهيلع
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/d96fa3d9-8e3a-442f-a8bf-58bdcd2cf064.lstmf line 0 :
Mean rms=3.921%, delta=64.549%, train=19.016%(97.917%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/3569b057-cd3b-4133-8715-db1e870e9bb2.lstmf
Iteration 4: GROUND  TRUTH : رعوضا ًوطلبات لنسخ من منشورات اكثر من 3500 مكتبة ومؤسسة في كافة
Iteration 4: ALIGNED TRUTH : رعوضاًوطباوتلبا ت لنسخمن منشورات اكاكثر من 3500 مكتبة ومؤسؤسسة ففي كاففة
Iteration 4: BEST OCR TEXT : ةفا يف ةسسؤمو ةيتكم ۳وء٠ نم رثكأ تاروشنم نم خيسنل تابلطو ًاضورع
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/9b47f595-537e-4596-83b1-72547e9228e2.lstmf line 0 :
Mean rms=3.922%, delta=65.591%, train=19.657%(98.333%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/0a4e9e65-613a-432b-b59d-6d5ccdfd21f1.lstmf
Iteration 5: GROUND  TRUTH : من اعمال .
Iteration 5: ALIGNED TRUTH : من اععمال .
Iteration 5: BEST OCR TEXT : . لامعا مرم
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/d85a47f7-c556-448a-ad8b-cf5079ee3214.lstmf line 0 :
Mean rms=3.77%, delta=60.634%, train=21.381%(93.056%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/df0606bf-783e-4c5c-957d-277f0e4f26a2.lstmf
Iteration 6: GROUND  TRUTH : المركزي اعماله ، فحدد المبلغ الاقصى لمجموع عمليات التسليف
Iteration 6: ALIGNED TRUTH : االمركزييي اعماله ، ففحد البل الاقصى للمجموع عمليات التسليفف
Iteration 6: BEST OCR TEXT : فيلستلا تايلمم عومج ىصقالا غليلا ددحف » هلاما يزك رلا
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/2158d4b2-d469-49d8-95e5-e6a89efc8df7.lstmf line 0 :
Mean rms=3.772%, delta=61.089%, train=20.833%(94.048%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/9fc039ef-7fc5-40be-bf88-9bce30e0550b.lstmf
Iteration 7: GROUND  TRUTH : 557
Iteration 7: BEST OCR TEXT : o۷
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/4587c0dc-814c-462b-9370-0930f4ce33d7.lstmf line 0 :
Mean rms=3.601%, delta=56.651%, train=39.062%(94.792%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/921765bb-3029-427d-9ec6-5cbd1c550328.lstmf
Iteration 8: GROUND  TRUTH : الاتجاه الاجتماعي في الادب العربي الحديث
Iteration 8: ALIGNED TRUTH : الاتتجاه الاجتتماعي في الاددب لعربي الحديث
Iteration 8: BEST OCR TEXT : تيرحلا ييرلا بدالا يف يعام# رلا مايمرلا
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/62fb71fe-b9f3-43a6-ad81-5db85e4c9e54.lstmf line 0 :
Mean rms=3.608%, delta=57.301%, train=40%(95.37%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/cd2079bc-5d2e-4046-9f74-df315595265f.lstmf
Iteration 9: GROUND  TRUTH : (١١) الصورة الدارجة لقولنا في الفصيح : «يخبر» .
Iteration 9: ALIGNED TRUTH : (١١١) الصورة الدارجة للقولنا يف الففصح : «ييخبببر» ...
Iteration 9: BEST OCR TEXT : . » ريخي « : حيصفلا يث انلوقل ةجرادلا ةروصلا )۱١(
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/16855714-c824-4c45-97dc-73fed3652b54.lstmf line 0 :
Mean rms=3.674%, delta=59.413%, train=37.702%(93.611%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/5d77c645-c349-4c1d-a077-44dfcb69d7d9.lstmf
Iteration 10: GROUND  TRUTH : لا . ت . 40 ص .
Iteration 10: ALIGNED TRUTH : لا . .ت . 440 صتصت.4 .
Iteration 10: BEST OCR TEXT : ۰ء ه ۰ ٠ ت* ال
File /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/2643311f-fa00-428c-b57d-ad794202dbcf.lstmf line 0 :
Mean rms=3.578%, delta=56.284%, train=42.153%(94.192%), skip ratio=0%
Loaded 1/1 lines (1-1) of document /home/hartwig/Projekte/work/mlu/ulb/github-openiti-trainingdata/JSTORArabic/cb57baed-e09e-48b3-97e1-6295bff1de8f.lstmf

I tried just to run 100 Iterations, this time having even worse results as usual ;)

If you mind, I can provide the project setup?

Shreeshrii commented 3 years ago

My version of makefile is slightly different. I have opened a few PRs with the changes.

Also, I copy the required unicharset files from langdata_lstm repo to data/.

In my test, I deleted all lines and images which had English and Hebrew.

I will upload the files and instructions later today.

M3ssman commented 3 years ago

@Shreeshrii Many thanks in advance!

Shreeshrii commented 3 years ago

@M3ssman Uploaded files to https://github.com/Shreeshrii/tesstrain-JSTORArabic

I cloned the tesstrain repo, applied my changes and then downloaded JSTORArabic training data and modified it. Please see setup.sh for the steps followed.

setup.sh has already been run before uploading files to github repo.

train.sh has the invocation of makefile to start training. Please see if this works in your environment.

I have not made any changes to include the RTL and LTR marks in the unicharset or to remove them from the training data. So there are going to be a number of errors related to that.

M3ssman commented 3 years ago

@Shreeshrii Thanks for your investigations! Unfortunately, I cannot locate train.sh in the mentioned repository. There's a call of lstmtraining in 9-run_tess_test.sh , but I'm quite confused which steps to take to get from 0-setup.sh up to this.

The last commit in my local clone of your repository dates from Jan 22, 2020 (58b63f95). I just tried to head over to https://github.com/Shreeshrii/tesstrain-JSTORArabic, to update, but could not access remote. If you already removed it, could you please mind to re-activate?

Shreeshrii commented 3 years ago

The repo should be accessible now.

With limited knowledge of Persian, I noticed that there are problems with the training data that I used from JSTORArabic. Different folders of transcription follow different conventions. Some seemed to have already reversed the text.

So, I am not very sure how good the data and training is. But you can see the scripts and hopefully they will be of help.

On Tue, Jan 12, 2021 at 1:12 PM Uwe Hartwig notifications@github.com wrote:

@Shreeshrii https://github.com/Shreeshrii Thanks for your investigations! I just tried to head over to https://github.com/Shreeshrii/tesstrain-JSTORArabic, but could not access. If you already removed it, could you please mind to re-activate?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/213#issuecomment-758470514, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I6MGP6U6IUU7UY4FUDSZP4OLANCNFSM4U5SECBQ .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

M3ssman commented 3 years ago

@Shreeshrii Many thanks for investigating the characters stuff!

By now looks like complains alike Encoding of string failed! are reduced (although there are still some present). Rather early, after iteration 50, the output of BEST OCR TEXT gets shorter and shorter and vanishes about round 100.

Using JSTORArabic data, your repository and Tesseract 4.1.1 from alex-p.

Shreeshrii commented 3 years ago

That does not match results I get. I restarted training again, just now. Please see https://github.com/Shreeshrii/tesstrain-JSTORArabic/blob/master/data/JSTORArabic.log

Shreeshrii commented 3 years ago

I think the ground truth needs to be reviewed against the images for Western numbers, because it seems to me that they are not used in a consistent manner. So, my global replace of those by Arabic numbers may cause accuracy issues.

For example (I am guessing based on the OCR text) check these:

Iteration 18: GROUND  TRUTH : ٤٠٤
Iteration 18: ALIGNED TRUTH : ٤٤٠٤
Iteration 18: BEST OCR TEXT : 6 * 
File data/JSTORArabic-ground-truth/99633833-eb2e-438a-a478-a1c494ddf445.lstmf line 0 :
Iteration 20: GROUND  TRUTH : بلح قوس يف بوبحملا ١٠٥ ٥٣١ » لوليا
Iteration 20: ALIGNED TRUTH : بلح قوس يف بوبحملا ١٠٠٥ ٥٥ ٥٣٠١ »٥٥٣١ »»»» لوليا
Iteration 20: BEST OCR TEXT : بلح قوس يف بوحملا ١.# 00 ٠١ 0« 0 لوليأ
File data/JSTORArabic-ground-truth/9604d0b3-6f80-4da2-80c2-3b4251efb55c.lstmf line 0 :
Iteration 31: GROUND  TRUTH : . ٢٩٥ - ٢٠٨ ص (١٩٦٦) ٢/١٩ ج . ةيمألا ةحفاكم يف ةسارد
Iteration 31: ALIGNED TRUTH : . ٢٢٩٥ - ٢٢٠٨ ص (١٩٦٦) ٢/١٩ ج . ةيمألا ةحفاكم يف ةسارد
Iteration 31: BEST OCR TEXT : .7١ -7١8 ص )1435( 1/١9 ج .ةيمألا ةحفاكم يف ةسارد
File data/JSTORArabic-ground-truth/9e1b191d-f838-4bf4-85aa-150a3afb3587.lstmf line 0 :
M3ssman commented 3 years ago

Additionally, I removed some more chars rather than in your setup included, to get rid off these quotation/parenthesis chars.

rm -v $(grep -e '[\(|\)|\<|\||\>|«|»]' ${DATA_DIR}*.txt | sed s/gt.txt.*$/*/)
rm -v $(find ${DATA_DIR} -size -4c | sed s/.gt.txt/.*/)
rm -v $(file ${DATA_DIR}*.png | grep ", . x " | sed s/png/*/)
rm -v $(file ${DATA_DIR}*.png | grep ", .. x ." | sed s/png/*/)

This drops about 1.500 pairs text+image, leaving 6.461 pairs left for training - which is quite sufficient, according to my previous trainings, which are restricted to german fracture newspapers (in this realm all fits well ...)

jstor-ara.zip

M3ssman commented 3 years ago

Right now I also discovered via for f in 'data/jstor-ara-ground-truth/*.txt'; do file $f | grep -v "UTF-8"; done that about 20 files or so are interpreted as ASCII text, with no line terminators

M3ssman commented 3 years ago

@Shreeshrii I've tried to use your plotting mechanics within this project to visualize the training process as you did previously, but I already fail to find the plot/plot_cer.sh script mentioned in the according README section. There are only 2 python scripts in plot , both already expect a file plot_cer.csv . How can I produce this data from the logfile?

Shreeshrii commented 3 years ago

I changed plotting to use Makefile rather than bash script. That PR is still pending.

Use the following to create the plots.

cd plot
make MODEL_NAME=JSTORArabic
make MODEL_NAME=JSTORArabic VALIDATE_LIST=eval

See https://github.com/Shreeshrii/tesstrain-JSTORArabic/blob/master/README.md

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

galdring commented 3 years ago

Dear @Shreeshrii I'm a colleague of @M3ssman and wanted to give you a short update on our Arabic training. We've uploaded our own training data to our repository, in case you're interested. There's also documentation in the wiki. We'd be very interested to hear your opinion about it.

Best regards

stweil commented 3 years ago

That's great. @galdring, do you want to add an entry at https://github.com/cneud/ocr-gt? I also suggest to add a license to your data (we use CC0).

galdring commented 3 years ago

Thanks for the reply, @stweil - we're currently discussing the best license, but will of course refer to your suggestion for this (I prefer as open as possible anyway). We'll gladly add it to the mentioned repo.

Shreeshrii commented 3 years ago

@galdring Thank you for sharing your Arabic training data.

I made slight modifications in training data and then ran replace top layer type of trainiing.

Ground Truth Modifications

# Remove image and line with E (leave X, I and V used in Chapter Numbers)
rm -v $(grep E *.txt|sed s/gt.txt.*$/*/)

# Change persian numbers to arabic
sed -i -e 's/۱/١/g' *.gt.txt
sed -i -e 's/۲/٢/g' *.gt.txt
sed -i -e 's/۳/٣/g' *.gt.txt
grep -e '۱' *.txt
grep -e '۲' *.txt
grep -e '۳' *.txt

# Remove RLM and LRM
sed -i -e 's/‎//g' *.gt.txt
sed -i -e 's/‏//g' *.gt.txt

# Change Initial form of letters to regular
sed -i -e 's/ﺟ/ج/g' *.gt.txt
sed -i -e 's/ﻗ/ق/g' *.gt.txt
grep ﺟ *.gt.txt
grep ﻗ *.gt.txt

# Change ASTERISK to ARABIC FIVE POINTED STAR
sed -i -e 's/\*/٭/g' *.gt.txt
grep '*' *.gt.txt

all-gt-chars.log

44675        SPACE
22662    ا   ARABIC LETTER ALEF
17901    ل   ARABIC LETTER LAM
10215    م   ARABIC LETTER MEEM
9999     و   ARABIC LETTER WAW
9713     ي   ARABIC LETTER YEH
9631     ن   ARABIC LETTER NOON
7126     ب   ARABIC LETTER BEH
6571     ه   ARABIC LETTER HEH
6286     ر   ARABIC LETTER REH
5529     ع   ARABIC LETTER AIN
4798     ف   ARABIC LETTER FEH
4331     ت   ARABIC LETTER TEH
4318     د   ARABIC LETTER DAL
3901     ق   ARABIC LETTER QAF
3786     َ   ARABIC FATHA
3521     س   ARABIC LETTER SEEN
3019     ك   ARABIC LETTER KAF
3016     أ   ARABIC LETTER ALEF WITH HAMZA ABOVE
2905     ح   ARABIC LETTER HAH
2779     ة   ARABIC LETTER TEH MARBUTA
2460     ّ   ARABIC SHADDA
2371     ،   ARABIC COMMA
2235     ُ   ARABIC DAMMA
2150     ى   ARABIC LETTER ALEF MAKSURA
2087     ج   ARABIC LETTER JEEM
1845     ذ   ARABIC LETTER THAL
1682     ص   ARABIC LETTER SAD
1645     :   COLON
1520     ِ   ARABIC KASRA
1356     ش   ARABIC LETTER SHEEN
1325     ْ   ARABIC SUKUN
1277     .   FULL STOP
1252     خ   ARABIC LETTER KHAH
1116     ث   ARABIC LETTER THEH
1069     ط   ARABIC LETTER TAH
1056     إ   ARABIC LETTER ALEF WITH HAMZA BELOW
1039     ز   ARABIC LETTER ZAIN
969      (   LEFT PARENTHESIS
955      )   RIGHT PARENTHESIS
842      ١   ARABIC-INDIC DIGIT ONE
824      ض   ARABIC LETTER DAD
771      ٢   ARABIC-INDIC DIGIT TWO
747      ٤   ARABIC-INDIC DIGIT FOUR
700      "   QUOTATION MARK
689      ً   ARABIC FATHATAN
682      ٣   ARABIC-INDIC DIGIT THREE
629      ء   ARABIC LETTER HAMZA
600      غ   ARABIC LETTER GHAIN
469      ٥   ARABIC-INDIC DIGIT FIVE
453      ٩   ARABIC-INDIC DIGIT NINE
421      ٨   ARABIC-INDIC DIGIT EIGHT
420      ٦   ARABIC-INDIC DIGIT SIX
377      ئ   ARABIC LETTER YEH WITH HAMZA ABOVE
363      ٧   ARABIC-INDIC DIGIT SEVEN
305      ظ   ARABIC LETTER ZAH
298      ٌ   ARABIC DAMMATAN
297      /   SOLIDUS
246      ٠   ARABIC-INDIC DIGIT ZERO
241      ٍ   ARABIC KASRATAN
221      ـ   ARABIC TATWEEL
181      آ   ARABIC LETTER ALEF WITH MADDA ABOVE
178      -   HYPHEN-MINUS
153      ؤ   ARABIC LETTER WAW WITH HAMZA ABOVE
120      »   RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
119      «   LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
116      [   LEFT SQUARE BRACKET
115      ]   RIGHT SQUARE BRACKET
100      +   PLUS SIGN
96   ؛   ARABIC SEMICOLON
74   X   LATIN CAPITAL LETTER X
65   ؟   ARABIC QUESTION MARK
53   I   LATIN CAPITAL LETTER I
50   ﴿   ORNATE RIGHT PARENTHESIS
50   ﴾   ORNATE LEFT PARENTHESIS
38   ٮ   ARABIC LETTER DOTLESS BEH
38   1   DIGIT ONE
37   <   LESS-THAN SIGN
37   >   GREATER-THAN SIGN
27   2   DIGIT TWO
15   ٰ   ARABIC LETTER SUPERSCRIPT ALEF
13   |   VERTICAL LINE
13   ٖ   ARABIC SUBSCRIPT ALEF
13   3   DIGIT THREE
11   =   EQUALS SIGN
11   ٱ   ARABIC LETTER ALEF WASLA
10   ی   ARABIC LETTER FARSI YEH
7    4   DIGIT FOUR
6    !   EXCLAMATION MARK
6    ﷺ   ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM
6    ٭   ARABIC FIVE POINTED STAR
6    5   DIGIT FIVE
3    }   RIGHT CURLY BRACKET
3    {   LEFT CURLY BRACKET
3    ,   COMMA
3    6   DIGIT SIX
2    ک   ARABIC LETTER KEHEH
2    V   LATIN CAPITAL LETTER V
2    ~   TILDE
2    ‐   HYPHEN
2    8   DIGIT EIGHT
2    7   DIGIT SEVEN

Setup for training

mkdir data/gt4ara-1

cp ~/langdata_lstm/ara/ara.config data/gt4ara-1/gt4ara-1.config
cp ~/langdata_lstm/ara/ara.punc data/gt4ara-1/gt4ara-1.punc
cp ~/langdata_lstm/ara/ara.numbers data/gt4ara-1/gt4ara-1.numbers

nohup make MODEL_NAME=gt4ara-1 START_MODEL=ara LANG_TYPE=RTL  GROUND_TRUTH_DIR=$HOME/tesstrain/data/gt4ara-ground-truth/ TESSDATA=$HOME/tessdata_best    DEBUG_INTERVAL=-1 unicharset > data/gt4ara-1.log &

python ./normalize.py data/gt4ara-1/all-gt
python count_chars.py data/gt4ara-1/all-gt  | sort -n -r > data/gt4ara-1/all-gt-chars.log

## Check data/gt4ara-1/all-gt-chars.log and fix gt.txt files as needed

nohup make MODEL_NAME=gt4ara-1 START_MODEL=ara LANG_TYPE=RTL  GROUND_TRUTH_DIR=$HOME/tesstrain/data/gt4ara-ground-truth/ TESSDATA=$HOME/tessdata_best    DEBUG_INTERVAL=-1 lists --trace >> data/gt4ara-1.log &

find $HOME/tesstrain/data/gt4ara-ground-truth/ -name '*.lstmf' | wc -l 
find $HOME/tesstrain/data/gt4ara-ground-truth/ -name '*.box' | wc -l 
find $HOME/tesstrain/data/gt4ara-ground-truth/ -name '*.tif' | wc -l 
find $HOME/tesstrain/data/gt4ara-ground-truth/ -name '*.gt.txt' | wc -l 

Training (Not using the default training command from makefile)

make MODEL_NAME=gt4ara-1 GROUND_TRUTH_DIR=data/gt4ara-ground-truth TESSDATA=$HOME/tessdata_best START_MODEL=ara  LANG_TYPE=RTL  DEBUG_INTERVAL=-1  EPOCHS=20  RATIO_TRAIN=0.90  training --trace 

### Kill job and run Layer training (max-iterations -100 = 100 EPOCHS)

mkdir -p data/gt4ara-1/checkpoints
nohup lstmtraining \
  --debug_interval -1 \
  --traineddata data/gt4ara-1/gt4ara-1.traineddata \
  --append_index 5 --net_spec '[Lfx192O1c1]' \
  --continue_from data/ara/gt4ara-1.lstm \
  --learning_rate 0.001 \
  --model_output data/gt4ara-1/checkpoints/gt4ara-1 \
  --train_listfile data/gt4ara-1/list.train \
  --eval_listfile data/gt4ara-1/list.eval \
  --max_iterations -100 \
  --target_error_rate 0.01   2>&1 >> data/gt4ara-1.log  & 

Plotting

bash -x  plot.sh gt4ara-1 eval 15

gt4ara-1-eval-cer

There continues to be a gap between the training and evaluation results (which I have seen with other Arabic/Persian datasets also).

I do not know enough about the RTL processing to say whether there is some missing code in the Open Source tesseract code base regarding the Arabic numbers and punctuation reversal. Ray had done the training at Google and the resulting traineddata files were opensourced but we don't have the source training text or info on the exact method used.

How do these results compare with what you and @M3ssman have tried?

I will try training with some other variations too.

wrznr commented 3 years ago

After personal communication with @galdring, it seems that the problem could be solved with your hints @Shreeshrii.