tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.59k stars 9.24k forks source link

Fine Tuning Leads to Segmentation Issue #2132

Open jaddoughman opened 5 years ago

jaddoughman commented 5 years ago

Environment

Current Behavior:

I wanted to OCR a large dataset of Arabic newspapers with difficult delimiters and spacing. After running your original pre-trained model, I managed to recall about 80% of the required data. I opted to fine tune your existing ara.traineddata file by using text lines as my training and test data set. I used the "OCR-d Train" tool on GitHub to generate the neccessary .box. files.

Throughout the fine tuning process, the Eval percentages decreased tremendously, which means that the model was successfully trained. I re-evaluated using my own method and confirmed the successful training process.

However, the test dataset used was made up of text lines. So, your and my evaluation were generated on a text line level. The issue occurred when I ran the fine tuned model on a complete Newspaper sample (constituted of the same text line fonts). The accuracy decreased significantly compared to your original pre-trained model. This made no sense at all, my fine tuned model has better accuracy than your model on a text line level, but when running it on a complete newspaper (constituted of the same text line fonts), your pre-trained model is performing better than my successfully fine-tuned model.

The issue seems to be connected to your segmentation algorithm. This is a major problem, since this means that your training tool only works on a text line level and cannot be applied to any other form of dynamic text extraction. You will find below a sample newspaper, my fined tuned model, and the learning curve from the training process.

Sample Newspaper: Sample Newspaper.zip

Fine Tuned Model: ara_finetuned.traineddata.zip

Learning Curve: Learning Curve (60k Iterations).pdf

jaddoughman commented 5 years ago

Any idea of what might be causing this issue ?

@amitdo @Shreeshrii

stweil commented 5 years ago

Here is a visualisation (using https://github.com/kba/hocrjs) for both results:

The layout recognition is clearly different.

jaddoughman commented 5 years ago

What would be the reason behind the different layouts ? Why would my fine tuning have an impact ?

Also, thank you for your support @stweil

Shreeshrii commented 5 years ago

Based on recommendations for tess tutorial in wiki by @theraysmith finetuning should only be done for limited number of iterations. He had suggested 400 for finetuning for 'impact' and 3000 for finetuning to add a character. So, 60000 is probably too large.

Also, please check the --psm being used for training by the ocr-d/train script. Ray has mentioned as part of the LSTM training notes that the models have been trained per line.

Shreeshrii commented 5 years ago

https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00#integration-with-tesseract

Integration with Tesseract

The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.

The neural network engine is the default for 4.00. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13

The neural network engine has been integrated to enable the multi- language mode that worked with Tesseract 3.04, but this will be improved in a future release. Vertical text is now supported for Chinese, Japanese and Korean, and should be detected automatically.

@stweil Does this mean that layout analysis has changed since tessdata_best was trained?

jaddoughman commented 5 years ago

As shown in the learning curve uploaded above, the training process was successful (even for 60k iterations). The accuracy improved on a text line level. My issue as explained above and shown in the layout representations, is that of segmentation. When running the trained model on a complete newspaper, the accuracy goess way off.

Have a look at the layout representations above. I used --psm 7 for training.

@Shreeshrii

stweil commented 5 years ago

Does this mean that layout analysis has changed since tessdata_best was trained?

Why do you think so?

stweil commented 5 years ago

@jaddoughman, is this result better?

I added ara.config which was missing in ara_finetuned.traineddata.

stweil commented 5 years ago

There are some more components which could be taken from the original ara.traineddata:

ara.lstm-number-dawg
ara.lstm-punc-dawg
ara.lstm-recoder
ara.lstm-unicharset
ara.lstm-word-dawg
jaddoughman commented 5 years ago

No, even after adding the dawg files the issue remains. I can't seem to understand how training a model is in any way connected to the segmentation process. The layout representation should be identical in all model or am I wrong ?

@stweil

stweil commented 5 years ago

I‌ would have thought so, too, but recently I‌ noticed some cases which are even more strange:

ara.config changes the segmentation process, so that is something which needs to be added to the documentation: add config from original traineddata to finetuned traineddata.

jaddoughman commented 5 years ago

I trained twice, once including the dawg files and the other excluding them. The training which included the dawg file was better then the one excluding it. However, both were way worse than the original model.

Also, note that training was successful (learning curve attached above). On a text line level, the results are near perfect. However, I need the transcription of the complete newspaper sample. This was a part of a 12-month long research paper, to finally reach this issue is devastating.

On a technical level, there needs to be an explanation of why and how training any model would in any way alter the segmentation process.

@stweil

Shreeshrii commented 5 years ago

@jaddoughman Which psm are you using for the complete newspaper sample? If is the default i.e. psm 3 then please try the training with --psm 3 (or without specifying the psm) as an experiment and see if the results are better.

jaddoughman commented 5 years ago

I attached @Shreeshrii 's fine tuned Arabic model below. Is it possible @stweil to generate its corresponding layout representation ? This can help us reach a conclusive decision on our initial assumption concerning the segmentation issue.

Fine Tuned Model: ara-amiri-3000.traineddata.zip

stweil commented 5 years ago

Here it is:

jaddoughman commented 5 years ago

Original Model: https://ub-blade-01.bib.uni-mannheim.de/~stweil/tesseract/issues/2132/Sample1.html @Shreeshrii 's Fine Tuned Model: https://ub-blade-01.bib.uni-mannheim.de/~stweil/tesseract/issues/2132/Sample1-ara-amiri-3000.html

The above results prove that our assumption concerning the segmentation is true. Any explanation of the relation between fine tuning and word detection (segmentation) would be greatly appreciated. Understanding the problem can help in finding a workaround.

@stweil @theraysmith

amitdo commented 5 years ago

The layout analysis phase detects:

Words and glyphs splitting is part of the OCR phase and not part of the layout analysis phase.

jaddoughman commented 5 years ago

Why is fine tuning changing the word recognition ? How can I fix my issue ?

Also, if word splitting occurs at OCR phase, then why do I have different results when running the exact same line on a complete newspaper versus on a text line level ? Meaning: if I OCR a single text line I get a different result than OCRing a complete newspaper containing that text line.

@amitdo

amitdo commented 5 years ago

Looking again in the code, it seems that words splitting does occur in the layout analysis phase...

I think the word splitting can still be changed by the ocr phase.

Sorry, I don't have answers to your last questions.

jaddoughman commented 5 years ago

Can the code be altered to include splitting in the OCR phase ? I see no reason why word splitting should be altered during training. My training dataset constituted of 4000 text lines that required crowdsourcing to generate. A lot of time was invested to train the model. Any help would be greatly appreciated.

If any of the other developers have an answer I would be happy to try any alternative fix.

@amitdo

amitdo commented 5 years ago

Sorry, I don't know how to help you with this issue.

Shreeshrii commented 5 years ago

@jaddoughman I unpacked your traineddata file with combine_tessdata. The lstm_unicharset in it has 303 characters. So it seems to me that you have trained using script/Arabic from tessdata_best rather than ara.traineddata. If that is indeed the case please try starting from ara.traineddata to see if there is any difference.

Also, please share the exact version of tesseract that you are using. Your traineddata file reported beta.3.

jaddoughman commented 5 years ago

I trained using both the script and the tessdata_best. Both altered the segmentation leading to the same issue. I was using Tesseract 4.0 during training. However, even if another version was used, i also tried your fine tuned model that also resulted in altered word detection.

Is it possible to alter the code so that the word splitting resides in the layout process and not the OCR one ?

@Shreeshrii

jaddoughman commented 5 years ago

If i attach my training data set, is it possible for you to fine tune it to ensure that the issue isn't related to my training process ?

@Shreeshrii

Shreeshrii commented 5 years ago

@theraysmith is the only one with enough knowledge about the code to suggest a solution and according to @jbreiden he is now busy on another project at Google. If you can share your box/tiff pairs I can give a try to fine tuning with it. However, I have to admit that I have not had much success in adding the digits in Arabic script by fine tuning to the traineddata.

jaddoughman commented 5 years ago

The below dataset contains about 4000 text lines. The txt files below are in RTL order. I was informed that they needed to be changed to LTR. I attempted to change them to LTR by inverting the string of every text file. The below dataset is in RTL since one of my issue might be caused by my conversion attempt. I fine tuned for 60,000 iterations and saw a great improvement in accuracy on a text line level. I think your training attempt can help us reach a conclusive decision on the origin of the issue.

Thank you for your help.

Dataset: dataset.zip

@Shreeshrii

Shreeshrii commented 5 years ago

According to posts by Ray, training for all languages is done in LTR order and there is a routine in tesseract to handle the change to RTL later.

I do not know Arabic hence can not check whether the conversion is correct. I am relying on text2image to create the correct box files.

I have concatenated your text files to create a training_text for fine tuning. I will run the training with Scheherazade font and share the results.

From my earlier experience fine tuning seems to work best when the training text used is what was used for initial training. For Arabic we do not have that file available, we only have the 80 line training_text (similar to 3.04).

Shreeshrii commented 5 years ago

@stweil

Does this mean that layout analysis has changed since tessdata_best was trained?

Why do you think so?

No concrete proof :-(

There have been issues with page segmentation, word dropping for a while. There are probably a number of issues related to them still open.

So, something has definitely changed.

If it is not seen in eng, deu and other latin script based languages then it maybe related to complex script processing / unichar compression / recoding.

Were you able to get the unit tests related to unichar compression to work? Maybe they can help in figuring out the issue.

jaddoughman commented 5 years ago

Concerning the LTR conversion, one of the Tesseract developers told me to use Fribidi for the conversion. If tesseract doesn't handle the conversion, you can use it.

@Shreeshrii

amitdo commented 5 years ago

text2image handles bidi. You only need Fribidi if you train from scanned images.

Shreeshrii commented 5 years ago

Please see Ray's comments in https://github.com/tesseract-ocr/tesseract/issues/648#issuecomment-271870748 - These are from Jan 2017. He has made changes to the processing for Arabic after that. I will try and find those comments and commits and link here for reference too.

Shreeshrii commented 5 years ago

https://github.com/tesseract-ocr/tesseract/commit/3e63918f9db4150a3d1ff6136df9b753e507ae41

2017-09-08 (3e63918) Ray Smith: Fixed order of characters in ligatures of RTL languages issue #648

Shreeshrii commented 5 years ago

https://github.com/tesseract-ocr/tesseract/commit/4e8018d013e3cefa55f138c7446264ca8931861a

2017-07-19 (4e8018d) Ray Smith: Important fix to RTL languages saves last space on each line, which was previously lost

jaddoughman commented 5 years ago

Okay, did your attempt at fine tuning work with the given dataset ? Your attempt is important since my extracted traineddata file is reported beta version 3.

@Shreeshrii

Shreeshrii commented 5 years ago

@jaddoughman Please see https://github.com/Shreeshrii/tessdata_arabic

I have uploaded there various versions of finetuneddata using the training text based on your dataset. I have not used the scanned images.

If you know the font used for the newspaper, or a similar font, finetuning with that might give better result.

jaddoughman commented 5 years ago

I ran all your trained models on 5 testing samples, but the accuracy decreased on each one. The issue is still caused by word detection, since a fine tuned model would never perform worse than the original one. This is unfortunate. If any possible explanation arises concerning the connection between training and segmentation, please let me know.

Thank your for your help. @Shreeshrii

jaddoughman commented 5 years ago

Our Fine tuned model is performing better on a text line level. Hence, training is improving the accuracy on a text line level. One possible solution I'm exploring, is to segment the newspaper samples into text lines and OCR them using our fine tuned model. The issue here would be that I would need a segmentation algorithm to automate this process.

I created a tool using Tesseract's hOCR files, that parses the hOCR files into bbox and generates the corresponding text line images, however the segmentation was far from perfect. Do you recommend any other way to automatically segment the newspaper samples into text lines ? Or word extraction ? I just need the segmented text lines which can be trasncribed using our fine tuned model.

SAMPLE NEWSPAPER: Sample1.tif.zip

@Shreeshrii @amitdo @stweil

Shreeshrii commented 5 years ago

As an experiment, try to create the HOCR files using different language traineddata and see if the boxing is better.

Also try with --oem 0 i.e. base tesseract instead of lstm tesseract, and with older versions of tesseract (3.05, 3.04).

It would be good to know whether segmentation is different in all these cases and whether any are better for your use case.

You can also use leptonica directly for segmentation. Please look at the sample programs provided with it, I recall one which had good results for Arabic.

Shreeshrii commented 5 years ago

Please also see https://github.com/OCR-D/ocrd-train/issues/7

jaddoughman commented 5 years ago

I uploaded below the generated text line images by the Arabic, Arabic Fine-tuned, and English models (using their respective hOCR files). The results are that the English and Arabic text line images differ probably due to writing orientations (RTL and LTR), but the ara and ara_finetuned had the same results. This is what I predicted, but this doesn't lead me anywhere, since we already knew that fine tuning doesn't change on a text line level, but the recognition of words is what differs.

ENGLISH MODEL: Sample1_eng.zip ARABIC MODEL: Sample1_ara.zip ARABIC FINE TUNED MODEL: Sample1_ara_finetuned.zip

@Shreeshrii

Shreeshrii commented 5 years ago

I created a tool using Tesseract's hOCR files, that parses the hOCR files into bbox and generates the corresponding text line images, however the segmentation was far from perfect.

My reasoning for the experiment was that if another model gives you better segmentation, you can use it for splitting to line images and then use your finetuned model to ocr.

Shreeshrii commented 5 years ago

Also see https://github.com/tesseract-ocr/tesseract/issues/657

jaddoughman commented 5 years ago

I tried all variations of different language models and OEMs. No major difference was found. I think the most reasonable solution would be using Leptonica. However, isn't Tesseract powered by Leptonica ? If so, is it possible to generate different results than the hOCRs files generated by Tesseract ?

@Shreeshrii

amitdo commented 5 years ago

https://github.com/tesseract-ocr/tesseract/issues/657

Shreeshrii commented 5 years ago

I just used leptonica/prog/arabic_lines and changed input file name to arabic.png for testing.

The complete newspaper did not work well with it. I cropped a section with two columns.

ubuntu@tesseract-ocr:~/leptonica/prog$ ./arabic_lines
Info in pixRotate: 1 bpp; rotate by shear
Skew angle:   -0.25 degrees;   7.80 conf
Num columns: 2
Num textlines in col 0: 59
Num textlines in col 1: 57
sh: 1: xzgv: not found

Results for that are attached.

arabic.zip

jaddoughman commented 5 years ago

Thank you for you help. However, isn't Tesseract using the arabic_lines code to segment the inputted image ? If not, what is the code you are using ?

@Shreeshrii

Shreeshrii commented 5 years ago

isn't Tesseract using the arabic_lines code to segment the inputted image ?

No. Tesseract has its own layout analysis code which may be using other leptonica functions.

jaddoughman commented 5 years ago

Will you be fixing the issue of fine tuning leading to altered word detection in the coming Tesseract 4.1 updated version ? I believe this is a major obstacle specially in Arabic, since the pre-trained models are performing very bad. Even after you trained using a separate training data-set, the word detection was altered and the accuracy decreased substantially.

If you have any immediate fix or can guide me in a direction that fixes this issue, let me know. I have 185,000 images similar to the ones attached, and my trained model is suffering from the bug discussed above. Thank you for your help.

@Shreeshrii

Shreeshrii commented 5 years ago

The official traineddata has been trained by Ray Smith at Google. As far as I know there are no new updates planned.

I try to follow the guidelines given by Ray in tesstutorial or comments on issues for experimenting with training.

Regarding layout analysis, there are other similar open issues. I am not sure if there are any plans to address those for 4.1.0.

You can try posting in tesseract-ocr google group to see if someone has had better luck with improving Arabic traineddata.

Shreeshrii commented 5 years ago

Do you know which font is used in the images that you want to recognize? Or suggest a similar font.

On Sun, Mar 10, 2019 at 1:40 PM Jad Doughman notifications@github.com wrote:

Will you be fixing the issue of fine tuning leading to altered word detection in the coming Tesseract 4.1 updated version ? I believe this is a major obstacle specially in Arabic, since the pre-trained models are performing very bad. Even after you trained using a separate training data-set, the word detection was altered and the accuracy decreased substantially.

If you have any immediate fix or can guide me in a direction that fixes this issue, let me know. I have 185,000 images similar to the ones attached, and my trained model is suffering from the bug discussed above. Thank you for your help.

@Shreeshrii https://github.com/Shreeshrii

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2132#issuecomment-471256778, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o8c5iri1zILCMme7qQUYCRt9Mvrdks5vVL35gaJpZM4ZgdTt .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com