tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Generated .box files have identical coordinates for every character #32

Closed jaddoughman closed 5 years ago

jaddoughman commented 5 years ago

Environment:

tesseract 4.0.0 leptonica-1.76.0 libjpeg 9c : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 Found AVX2 Found AVX Found SSE

Platfrom:

Darwin Kernel Version 18.2.0 ; RELEASE_X86_64 x86_64

Current Behavior:

Tesseract 4.0 using the best ara.traineddata file is recalling about 85% of the data, which is pretty good. I'm attempting to train Tesseract using Fine Tuning for impact. I used the GitHub project OCR-D Train to generate the .box and .lstmf files required for training, since my training data is composed of text line images. After generating the required .box and .lstmf files, I trained tesseract with a couple of lines to 400 iterations, but the the generated transcription with the fined tuned model looks a lot like "ل.َ1ح*جُ ح( .َو!ة.اع5 ّة'عآة'ا ن'جة.!ع. ”.َئءؤئجآ| ن!.5ل". I exhausted all the possibilities by training to max_iteration 0 and and a low target_error_rate, but the results were similar.

The transcription generated by the new model can be found below (Fine Tuned.txt): Fine Tuned.txt

The transcription generated by the original Arabic model can be found below (Arabic Trained Model.txt): Arabic Trained Model.txt

The fine tuned model can be found below (test1.traineddata): test1.traineddata.zip

I attempted to train from scratch using 4000 text line images, but they weren't enough to make a difference and didn't seem logical if your trained model is recalling more than 80% of my data.

A sample of my training data which includes the .box and .lstmf is attached below: training data.zip

jaddoughman commented 5 years ago

و 0 0 223 17 0 ن 0 0 223 17 0 ق 0 0 223 17 0 ل 0 0 223 17 0 ت 0 0 223 17 0 0 0 223 17 0 ص 0 0 223 17 0 ح 0 0 223 17 0 ف 0 0 223 17 0 0 0 223 17 0 ه 0 0 223 17 0 ن 0 0 223 17 0 د 0 0 223 17 0 ي 0 0 223 17 0 ة 0 0 223 17 0 0 0 223 17 0 ا 0 0 223 17 0 م 0 0 223 17 0 س 0 0 223 17 0 0 0 223 17 0

amitdo commented 5 years ago

Generated .box files have identical coordinates for every character

It's not a bug, it's a feature.

The LSTM engine needs only line boxes, If you'll give it char boxes, the first thing it will do is make line boxes from the char boxes info.

jaddoughman commented 5 years ago

But in my above given example of the .box file. It is generated as RTL not LTR. Will this create an issue wen finetuning ? If yes, will inverting the strings to LTR fix my issue ?

@amitdo

amitdo commented 5 years ago

About the RTL issue.

The ground truth text file needs to be converted from logical order to visual order.

https://www.unix.com/man-page/linux/1/fribidi

wrznr commented 5 years ago

Thanks for sharing the example. I will try to test it tomorrow and get back to you.

Am 03.12.2018 um 21:55 schrieb jaddoughman notifications@github.com:

After following your instructions and converting the .gt.txt file and generated .box file to LTR order. The training is looking a lot like this:

Iteration 459: ALIGNED TRUTH : 00000000 0000000 00000 000000" 00000 00000 Iteration 459: BEST OCR TEXT : 00000000 0000000 00000 000000" 00000 00000" File data/train/line_1_5.lstmf page 0 (Perfect): Mean rms=4.529%, delta=32.952%, train=72.624%(67.841%), skip ratio=0% Iteration 460: ALIGNED TRUTH : 00000 0000 00 00000 0000 Iteration 460: BEST OCR TEXT : ل00000 0000 00 00000 0000 File data/train/line_1_7.lstmf page 0 : Mean rms=4.522%, delta=32.882%, train=72.484%(67.737%), skip ratio=0% Iteration 461: ALIGNED TRUTH : 000000000 0000000 00000 000000" 00000 00000 Iteration 461: BEST OCR TEXT : 00000000 0000000 00000 000000" 00000 0000 File data/train/line_1_5.lstmf page 0 (Perfect): Mean rms=4.515%, delta=32.811%, train=72.342%(67.626%), skip ratio=0% Iteration 462: ALIGNED TRUTH : 00000 0000 00 00000 0000 Iteration 462: BEST OCR TEXT : ل00000 0000 00 00000 0000

Can you explain the reason behind such an error. I will attach the txt and box files below.

line_1_5.gt.txt

line_1_7.gt.txt

line_1_8.gt.txt

0 67 894 0 0 ﻦ 0 67 894 0 0 ﻴ 0 67 894 0 0 ﺒ 0 67 894 0 0 ﻧ 0 67 894 0 0 ﺬ 0 67 894 0 0 ﻤ 0 67 894 0 0 ﻟ 0 67 894 0 0 ﺍ 0 0 894 67 0 0 67 894 0 0 ﺔ 0 67 894 0 0 ﺒ 0 67 894 0 0 ﻗ 0 67 894 0 0 ﺎ 0 67 894 0 0 ﻌ 0 67 894 0 0 ﻣ 0 67 894 0 0 ﻭ 0 0 894 67 0 0 67 894 0 0 ﻝ 0 67 894 0 0 ﺪ 0 67 894 0 0 ﻌ 0 67 894 0 0 ﻟ 0 67 894 0 0 ﺍ 0 0 894 67 0 0 67 894 0 0 ﻖ 0 67 894 0 0 ﻴ 0 67 894 0 0 ﻘ 0 67 894 0 0 ﺤ 0 67 894 0 0 ﺘ 0 67 894 0 0 ﺑ " 0 0 894 67 0 0 0 894 67 0 0 67 894 0 0 ﺪ 0 67 894 0 0 ﻬ 0 67 894 0 0 ﻌ 0 67 894 0 0 ﺘ 0 67 894 0 0 ﻳ 0 0 894 67 0 0 67 894 0 0 ﻙ 0 67 894 0 0 ﺭ 0 67 894 0 0 ﺎ 0 67 894 0 0 ﺒ 0 67 894 0 0 ﻣ " 0 0 894 67 0 894 67 895 68 0

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

jaddoughman commented 5 years ago

@wrznr

I attached a small dataset of Arabic text lines and their ground truth below. They are in RTL direction. I need to fine tune the _best arabic trained data model. Any help in doing so would be extremely appreciated.

Dataset.zip

jaddoughman commented 5 years ago

@amitdo

If i followed your instructions in changing the ground truth to LTR, wouldn't i have to invert the tiff images as well ? The txt file would be inverted when changed to LTR, wouldn't that be an issue when generating the .lstm files ?

jaddoughman commented 5 years ago

@amitdo @wrznr

Should i convert only the .gt.txt to LTR or should i also convert the resultant .box files ? If so, what should be done for the images ? Shouldn't they match the inverted text files ?

amitdo commented 5 years ago

You should use the regular RTL text for generating the images.

Try using tesseract's text2image and you'll see that the chars are in visual order (reversed).

@wrznr, if you want to support RTL text, you should check that the output of fribidi + splitting the chars to lines matches the output of text2image (the chars order should be the same, not the boxes).

jaddoughman commented 5 years ago

@amitdo

You misinterpreted my question. I already have the images of the text lines generated. I have their ground truth. Both the images (tif) and the text lines (gt.txt) are in RTL. After converting the gt.txt files to LTR using fribidi. Should i change the resultant box files and/or the original (tif) files ?

What are the necessary steps needed to fine tune using arabic text line image ?

jaddoughman commented 5 years ago

@amitdo Check out my data set below to visualize my issue.

Dataset.zip

amitdo commented 5 years ago

Should i change the resultant box files

The chars order in the box files should match the reversed ground truth text.

("Hello everyone" in Hebrew): שלום לכולם => םלוכל םולש =>

ם
ל
ו
כ
ל

ם
ו
ל
ש

and/or the original (tif) files ?

No.

jaddoughman commented 5 years ago

@amitdo

Okay, the box files will automatically have the same order as the reversed txt file. But does it create an issue if in the box files, the coordiantes came before the letter. Even when changing the txt file to RTL, the resultant box files gave the format of ( 0 0 0 0 letter) not (letter 0 0 0 0). 0 being any coordiante.

amitdo commented 5 years ago

In which application do you watch the box file?

jaddoughman commented 5 years ago

@amitdo

I open the box files using "gedit" on Ubuntu 16.04

amitdo commented 5 years ago

Please provide an example (just one tif, text, reversed, box).

jaddoughman commented 5 years ago

@amitdo

The Sample folder contains the normal tif file with the reversed text file (as you recommended using fribidi) and the generated box files (automatically reversed when generated).

Sample.zip

amitdo commented 5 years ago

I also opened it in gedit (Debian 9). It's fine.

jaddoughman commented 5 years ago

Okay, great. Does this mean that my Dataset is ready for fine tuning ? If yes, how many text lines like the one you saw is recommended for fine tuning ? Also, how many iterations is needed ? Thank you for your patience and support. @amitdo

wrznr commented 5 years ago

Also many thanks from my side @amitdo for your support on that matter. I successfully "fine-tuned" tesseract's Fraktur model with the latest version of ocrd-train:

make -j4 training START_MODEL=Fraktur TESSDATA=/home/kmw/built/tessdata_best/script

Place your training images in data/ground-truth, choose the model you want to fine-tune as START_MODEL, the folder the model is located in as TESSDATA and you should be fine. Pls. note that I haven't had time to test the procedure with an RTL data set yet. Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.

Pls. get back to us with your experience. Maybe we can even close this issue... ;)

amitdo commented 5 years ago

Not sure, probably 150-400 for each font.

amitdo commented 5 years ago

Problems are likely to occur, especially since your gedit shows something else than @amitdo 's.

I see it as he see it... but it still fine :-)

amitdo commented 5 years ago

@wrznr, FYI, Tesseract official lstm data was trained with degraded synthetic images. https://github.com/tesseract-ocr/tesseract/issues/1052

jaddoughman commented 5 years ago

@amitdo @wrznr I used the dataset that @amitdo approved off and attempted to fine tune the arabic _best model...

Iteration 700: ALIGNED TRUTH : عامتجا ماتخ يف رداقلا دبع اعدو Iteration 700: BEST OCR TEXT : َجع File data/train/line_1_34.lstmf page 0 : Mean rms=5.762%, delta=74.89%, train=184.604%(99.745%), skip ratio=0% Iteration 701: ALIGNED TRUTH : عامتجلا 0ا للا 0خ "هنأ ًاحضوم تاعاس Iteration 701: BEST OCR TEXT : َو
File data/train/line_1_30.lstmf page 0 : Mean rms=5.76%, delta=74.83%, train=184.479%(99.745%), skip ratio=0% Iteration 702: ALIGNED TRUTH : عباتي كرابم نا ادكؤم ،هددصلا Iteration 702: BEST OCR TEXT : ْاةع

What is the issue ? I have been attempting every variation of fine tuning for more than 2 weeks, the results are very disappointing. Any help would be really appreciated.

wrznr commented 5 years ago

Really hard to tell from distance. Three things I have noticed: 1. Do not expect any good results before the let's say 2000th iteration. 2. The TIFs in Dataset.zip are rather small in terms of file size (mostly about 3k while our sample line images are about 14k). 3. I could not open them with Ubuntu's standard image viewer: image

And, as I mentioned above, there is the issue of different gedit behaviors.This is what it looks like in my gedit: image

Correct or not?

jaddoughman commented 5 years ago

@wrznr

1) You can open the tif images using the Shotwell viewer (pre-installed with Ubuntu). 2) Concerning the size, i can generate more than 3k text lines easily, but for fine tuning i don't think that a large dataset is needed. Your Makefile is used for Training From Scratch i believe. 3) The dataset.zip file contains the txt files in RTL order, this was before i used fribidi to convert them to LTR. Do not attempt to train with them.

Can you elaborate on the functionality of your Makefile, are you fine tuning or training from scratch ?

jaddoughman commented 5 years ago

@wrznr

Also, concerning the bidirectional support, I can gladly edit your python script to enable its support of bidirectional text lines. This would be a major upgrade since training for Bidirectional languages is extremely useful.

wrznr commented 5 years ago

Your support is very welcome. If you file a pull request for bidi language support, we will gladly merge it. The makefile is supposed to support both, training from scratch as well as starting with a previously built model. With 3k, I reffered to individual file size rather than training set size.

Am 04.12.2018 um 18:14 schrieb jaddoughman notifications@github.com:

@wrznr

Also, concerning the bidirectional support, I can gladly edit your python script to enable its support of bidirectional text lines. This would be a major upgrade since training for Bidirectional languages is extremely useful.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jaddoughman commented 5 years ago

@wrznr

Yes, the small image size is due to the image extraction from the original image source. The text lines are extracted from a newspaper. So they are cropped to small images. However, tesseract handles these text lines easily with psm 6. I don't see how this creates an issue. Can you elaborate ?

jaddoughman commented 5 years ago

@wrznr

Also, how big was your training set size ? Do i need a lot of text lines to successfully fine tune ?

wrznr commented 5 years ago

@jaddoughman I do not have much experience with training productive models myself. Sorry. When we set up this repository, our hope was that we could get some of the necessary insights from users like you... But my guess would be that 3k text lines are enough to fine tune an existing model.

With 3k, I referred to individual file size rather than training set size.

File size not image size (a line in our example data set has typically about 13kb while your's have only 3kb). It might well be that the resolution of the images is to small. It should be at least 300dpi. But again, my experience is rather limited.