tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

high error rate on training RTL Persian-Arabic script #151

Closed sam-kurdi closed 4 years ago

sam-kurdi commented 4 years ago

I am training persian-Arabic script generated image lines= 500
why do I get a very high error rate

2 Percent improvement time=5600, best error was 99.665 @ 1000 At iteration 6600/6600/6600, Mean rms=8.606%, delta=51.091%, char train=97.628%, word train=99.967%, skip ratio=0%, New best char error = 97.628 wrote checkpoint.

is this because of the limited line number or something else? different fonts used in images but the ground truth is written in one font style is this causes high error rate ?

wrznr commented 4 years ago

Well, with 6000 steps and 500 lines CER should be way lower. But it is hard to tell from a distance without knowing your data... Sry.

stweil commented 4 years ago

@sam-kurdi, could you solve the problem? RTL needs special handling when generating the box files and currently not supported out-of-the-box by tesstrain.

sam-kurdi commented 4 years ago

according to this issue https://github.com/tesseract-ocr/tesstrain/issues/157#issuecomment-614774418 suggested by Shree, I did this modification and will test.

Changes to the generate_wordstr_box.py as follow:

create WordStr line boxes for Indic & RTL

for line in lines: line = unicodedata.normalize('NFC', line.strip()) if args.rtl:

FIXME: This should not be necessary. Compare with e.g. kraken

line = line.translate(str.maketrans("()[]{}»«><", ")(][}{«»<>")) if line: print("WordStr 0 0 %d %d 0 #%s" % (width, height, line)) print("\t 0 0 %d %d 0" % (width, height))'

On Thu, Apr 16, 2020 at 8:19 PM Stefan Weil notifications@github.com wrote:

@sam-kurdi https://github.com/sam-kurdi, could you solve the problem? RTL needs special handling when generating the box files and currently not supported out-of-the-box by tesstrain.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/151#issuecomment-614784659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOZY7DRRYQEMRD5PSHUOHQ3RM443ZANCNFSM4LS22F6A .

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.