tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
630 stars 184 forks source link

The trained language doesn't work on multi-lines #241

Closed akmalkadi closed 3 years ago

akmalkadi commented 3 years ago

Greetings,

I have trained tesseract from scratch on a dataset of 100k lines (for one font type). I got New best char error = 1.187

At iteration 43291/86900/86900, Mean rms=0.315%, delta=0.532%, char train=2.044%, word train=3.886%, skip ratio=0%,  wrote checkpoint.

At iteration 43298/87000/87000, Mean rms=0.299%, delta=0.492%, char train=1.935%, word train=3.789%, skip ratio=0%,  wrote checkpoint.

At iteration 43304/87100/87100, Mean rms=0.279%, delta=0.429%, char train=1.802%, word train=3.404%, skip ratio=0%,  wrote checkpoint.

At iteration 43311/87200/87200, Mean rms=0.278%, delta=0.426%, char train=1.814%, word train=3.373%, skip ratio=0%,  wrote checkpoint.

At iteration 43322/87300/87300, Mean rms=0.252%, delta=0.337%, char train=1.187%, word train=2.861%, skip ratio=0%,  New best char error = 1.187Previous test incomplete, skipping test at iteration43224 wrote best model:data/PSC5/checkpoints/PSC51.187_43322.checkpoint wrote checkpoint.

At iteration 43344/87400/87400, Mean rms=0.255%, delta=0.361%, char train=1.272%, word train=2.976%, skip ratio=0%,  New worst char error = 1.272 wrote checkpoint.

At iteration 43356/87500/87500, Mean rms=0.25%, delta=0.329%, char train=1.199%, word train=2.885%, skip ratio=0%,  New worst char error = 1.199 wrote checkpoint.

At iteration 43367/87600/87600, Mean rms=0.278%, delta=0.591%, char train=1.158%, word train=3.084%, skip ratio=0%,  New best char error = 1.158 wrote checkpoint.

At iteration 43377/87700/87700, Mean rms=0.277%, delta=0.553%, char train=1.189%, word train=3.468%, skip ratio=0%,  New worst char error = 1.189 wrote checkpoint.

At iteration 43388/87800/87800, Mean rms=0.291%, delta=0.61%, char train=1.362%, word train=3.604%, skip ratio=0%,  New worst char error = 1.362 wrote checkpoint.

At iteration 43396/87900/87900, Mean rms=0.287%, delta=0.602%, char train=1.338%, word train=3.475%, skip ratio=0%,  New worst char error = 1.338 wrote checkpoint.

At iteration 43413/88000/88000, Mean rms=0.293%, delta=0.595%, char train=1.255%, word train=3.899%, skip ratio=0%,  New worst char error = 1.255 wrote checkpoint.

At iteration 43421/88100/88100, Mean rms=0.303%, delta=0.683%, char train=3.811%, word train=4.078%, skip ratio=0%,  New worst char error = 3.811 wrote checkpoint.

At iteration 43426/88200/88200, Mean rms=0.303%, delta=0.687%, char train=3.804%, word train=4.154%, skip ratio=0%,  New worst char error = 3.804 wrote checkpoint.

At iteration 43431/88300/88300, Mean rms=0.294%, delta=0.671%, char train=3.74%, word train=3.768%, skip ratio=0%,  New worst char error = 3.74 wrote checkpoint.

At iteration 43443/88400/88400, Mean rms=0.271%, delta=0.596%, char train=3.528%, word train=3.262%, skip ratio=0%,  New worst char error = 3.528 wrote checkpoint.

At iteration 43449/88500/88500, Mean rms=0.268%, delta=0.623%, char train=3.563%, word train=3.266%, skip ratio=0%,  New worst char error = 3.563 wrote checkpoint.
.
.
.
At iteration 44578/99800/99800, Mean rms=0.261%, delta=0.519%, char train=6.131%, word train=2.909%, skip ratio=0%,  wrote checkpoint.

At iteration 44586/99900/99900, Mean rms=0.27%, delta=0.516%, char train=6.189%, word train=3.061%, skip ratio=0%,  wrote checkpoint.

At iteration 44600/100000/100000, Mean rms=0.3%, delta=0.633%, char train=8.195%, word train=3.399%, skip ratio=0%,  wrote checkpoint.

I tested the trained language on an image that has 18 lines. I got very bad results:

p p@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@g@

@@@@@

p@@@@@@@@@
@@@@@@
p@p@@@@@@@@@@
p@@@@@@@@

Nothing was correct from the extracted text. Then I tried to segment the image into lines, and I tested every line and got around 85% of the correct chars. Is there any missing step?

Thank you

wrznr commented 3 years ago

How did you apply your newly trained model? How does the multiline image look? Most likely, the defective behavior is caused by a segmentation problem (not a text recognition problem).

bertsky commented 3 years ago

@akmalkady perhaps you just looking for the Page Segmentation Mode (--psm) option? (See tesseract --help-psm.)

akmalkadi commented 3 years ago

How did you apply your newly trained model? How does the multiline image look? Most likely, the defective behavior is caused by a segmentation problem (not a text recognition problem).

I trained the model using the default settings also I used the default setting of https://github.com/UYousafzai/easy_train_tesseract/tree/fonts and I got the same issue.

The multiline image is a screenshot of the text editor.

akmalkadi commented 3 years ago

@akmalkady perhaps you just looking for the Page Segmentation Mode (--psm) option? (See tesseract --help-psm.)

I tried all --psm options and it didn't work (during the OCR process).
Do I need to change it during the training? I used the default one

wrznr commented 3 years ago

@akmalkady Now, I am confused. Are you using tesstrain (i.e. the tool which is developed and distributed via this repo) or easy_train_tesseract (which has nothing to do with tesstrain)? In the latter case, please post your issue there.

akmalkadi commented 3 years ago

@akmalkady Now, I am confused. Are you using tesstrain (i.e. the tool which is developed and distributed via this repo) or easy_train_tesseract (which has nothing to do with tesstrain)? In the latter case, please post your issue there.

I am facing the same problem with both.

wrznr commented 3 years ago

Okay, I am really not sure if I can help here. One problem which should be investigated is the fact that your CER is higher than your WER

At iteration 44600/100000/100000, Mean rms=0.3%, delta=0.633%, char train=8.195%, word train=3.399%, skip ratio=0%,  wrote checkpoint.

This is odd. (Because each word holding an incorrectly recognized character should be juged as incorrect as well.) So maybe something is wrong with our training process. Also the high ratio between iteration and actually training-processed lines

At iteration 43291/86900/86900

could indicate a problem (1st vs. 2nd and 3rd number). But I am no expert for training with generated lines. If you do not mind, post a sample image from the training set and your manual test stage. Maybe this will help to shed some light on this issue.

bertsky commented 3 years ago

@wrznr

One problem which should be investigated is the fact that your CER is higher than your WER This is odd. (Because each word holding an incorrectly recognized character should be juged as incorrect as well.)

This actually depends on the distribution of the errors. In the one extreme, if character errors are distributed across all words equally – CER will be much lower than WER. In the other, if character errors are focussed on a few words (or just one) – CER will possibly be higher than WER.

Plus in the case of Tesseract, CER and WER are measured as Bag-of-CER and Bag-of-WER, i.e. not via sequence alignment but as mere counts (across each line). (See #261 for details.) This makes such matters worse. It means that you could for example write each word's letters in random order – and still have zero BCER while having 100% BWER. Or you could insert an additional, very long all-garbage word – and get BCER close to 100% but BWER close to 0%. On the other hand, you could write each line's words in random order – and get zero BCER and BWER, although the text is completely illegible.

An additional problem in Tesseract's error metric is that the denominator of the rate calculations is the length of the GT sequence (but not the length of the alignment path). Thus, when the OCR sequence is longer than the GT sequence, the error rate gets overestimated (and can become larger than 100%).

bertsky commented 3 years ago

Also the high ratio between iteration and actually training-processed lines

At iteration 43291/86900/86900

could indicate a problem (1st vs. 2nd and 3rd number

That's not a high ratio at all IMO. Even in training from scans you get ratios up to 10. It depends on how homogeneous the data are and how consistent with the start model (when finetuning).

akmalkadi commented 3 years ago

Thanks, @wrznr @bertsky for your enriching discussion.

I have a question that would explain my issue. We know that we need single lines (.gt.txt and .png) to train the model. If I trained tesseract from scratch using single lines, can I use the trained language (traineddata) to extract text from an image that has multi-lines? Will tesseract do the line-segmentation for the image or do I need to do it myself and extract the text from each line?

I am asking that because all models I trained work only on single-line images.

bertsky commented 3 years ago

@akmalkady you are confusing (text) recognition and (page) segmentation. Tesseract's recognition (like all modern OCR engines) operates on line images. The CLI and API also have page segmentation (at various levels), but this is not model-driven (trained/neural) but algorithmic (rule-based).

So in the most basic use-case, you pass a line image to the CLI and set --psm 13 (raw line): this will do no segmentation at all. But you can also enter on --psm 6 (block) with regions images or --psm 3 (page) with fullpage images. This will do layout analysis and then pass the segmented (and cropped) lines to its recognition and finally aggregate these results into the output for that page.

In the training phase, segmentation (for obvious reasons) is not used, so you are effectively in PSM 13.

nagadomi commented 3 years ago

The Makefile here uses --psm 13 (raw line, use the whole image) option to create training data from gt images. When tesseract recognizes multiple lines (--psm 3 or --psm 6), each line image is automatically cropped by tesseract and passed to network with an additional 4px of padding. If the margin size(line spacing) of the gt images is significantly larger than the automatic crop by tesseract, the trained model may be overfitting for the margin size.

So I think you need to check what kind of images(result of line segmentation) are input to the network first.

akmalkadi commented 3 years ago

Thanks, @nagadomi @bertsky. I used different psm options during the training and during extracting text but I am still having issues with multi-lines. Now, I realized that even when I add additional padding to the image before the extracting process tesseract performs very badly even when I use --psm 3 or --psm 6.

I will upload a zip file that has two directories that have the same images. One of them is without additional padding and the other one is with additional padding. Each image will have the --psm 3 output using my trained tesseract. (The model was trained using English text for demonstration)

issue.zip

It supposes to have the same results since we have the same images except for the padding.

++ I said maybe the issue from the installed tesseract. So I tried the language file on another machine, and I got the same results. ++ Also, I tried the default English language to see if the issue is from the installed tesseract but I got no issues.

nagadomi commented 3 years ago

If the multi-line test image has a very small font as well as its training data, that may be the cause.

4px padding

4x padding actually extends the line box area. So if the input image has a small font, the padding size(extend size) will be relatively large and the pixels of neighboring lines may be included in the line image. Example input image: small_font

line images to be input to the network lstm_input_0 lstm_input_1 lstm_input_2

The default eng model works fine with these images, but probably will not work with models trained on images with only one line. This issue can be fixed by resizing the input image to a larger size. large_font lstm_input_0 lstm_input_1 lstm_input_2

issue.zip

The text files in new-padding/ seems to be corrupted.

akmalkadi commented 3 years ago

If the multi-line test image has a very small font as well as its training data, that may be the cause.

The first model I tried to train was using easy_train_tesseract by providing a training text to train tesseract from scratch. It was using text2image for one font type and the font size 12 (default). I had the same issue with extracting multi-line text from images.

Then, I used tesstrain to train a model also from scratch with data I generated for one font type and one font size 12. I had the same issue. Now, I am using tesstrain but this time for one font type but 8-16 font sizes. Here is a sample of the training data: training_sample.zip

Is using different font sizes is a wrong practice? I decided to use different fontsizes because when I trained the model using the font size 12, I found it performs well with images that have the font size 12, but when I tested an image with the font size 13, the result was bad.

The text files in new-padding/ seems to be corrupted.

The text files are the output I am getting from tesseract.

I will back to the other points you mentioned after answering my previous questions to more understand your points I really appreciate your time @nagadomi . Many thanks

nagadomi commented 3 years ago

--ptsize option in tesstrain.py/text2image is not the pixel size measure. I'm not very familiar with it, but when I check it with the image viewer, it's large then expected. If you specify ---save_box_tiff option to tesstrain.py, the tiff images will be saved in --output_dir and you can check it. For any font size, the line image will be resized to the network input size before being input (naturally, the resolution will be different). The network input size is specified at the beginning of the --net_spec option. For example, [1,36,0, .... will resize the height of the image to 36px.

akmalkadi commented 3 years ago

I noticed the issue happen when I train tesseract using custom data (pairs of image/lines). This thing happened to me when I used tesstrain and easy_train_tesseract. But when I followed the instruction of TessTuotrial, recognizing multi-lines worked on the model I trained using the generated lang-data by text2image.

I am not sure if it is because the model has trained on multi-lines tiff images or if the starting training data is the reason.