Closed sam-kurdi closed 4 years ago
Well, with 6000 steps and 500 lines CER should be way lower. But it is hard to tell from a distance without knowing your data... Sry.
@sam-kurdi, could you solve the problem? RTL needs special handling when generating the box files and currently not supported out-of-the-box by tesstrain.
according to this issue https://github.com/tesseract-ocr/tesstrain/issues/157#issuecomment-614774418 suggested by Shree, I did this modification and will test.
Changes to the generate_wordstr_box.py as follow:
for line in lines: line = unicodedata.normalize('NFC', line.strip()) if args.rtl:
line = line.translate(str.maketrans("()[]{}»«><", ")(][}{«»<>")) if line: print("WordStr 0 0 %d %d 0 #%s" % (width, height, line)) print("\t 0 0 %d %d 0" % (width, height))'
On Thu, Apr 16, 2020 at 8:19 PM Stefan Weil notifications@github.com wrote:
@sam-kurdi https://github.com/sam-kurdi, could you solve the problem? RTL needs special handling when generating the box files and currently not supported out-of-the-box by tesstrain.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesstrain/issues/151#issuecomment-614784659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOZY7DRRYQEMRD5PSHUOHQ3RM443ZANCNFSM4LS22F6A .
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I am training persian-Arabic script generated image lines= 500
why do I get a very high error rate
2 Percent improvement time=5600, best error was 99.665 @ 1000 At iteration 6600/6600/6600, Mean rms=8.606%, delta=51.091%, char train=97.628%, word train=99.967%, skip ratio=0%, New best char error = 97.628 wrote checkpoint.
is this because of the limited line number or something else? different fonts used in images but the ground truth is written in one font style is this causes high error rate ?