Closed akmalkadi closed 3 years ago
How did you apply your newly trained model? How does the multiline image look? Most likely, the defective behavior is caused by a segmentation problem (not a text recognition problem).
@akmalkady perhaps you just looking for the Page Segmentation Mode (--psm
) option? (See tesseract --help-psm
.)
How did you apply your newly trained model? How does the multiline image look? Most likely, the defective behavior is caused by a segmentation problem (not a text recognition problem).
I trained the model using the default settings also I used the default setting of https://github.com/UYousafzai/easy_train_tesseract/tree/fonts and I got the same issue.
The multiline image is a screenshot of the text editor.
@akmalkady perhaps you just looking for the Page Segmentation Mode (
--psm
) option? (Seetesseract --help-psm
.)
I tried all --psm options and it didn't work (during the OCR process).
Do I need to change it during the training? I used the default one
@akmalkady Now, I am confused. Are you using tesstrain
(i.e. the tool which is developed and distributed via this repo) or easy_train_tesseract
(which has nothing to do with tesstrain
)? In the latter case, please post your issue there.
@akmalkady Now, I am confused. Are you using
tesstrain
(i.e. the tool which is developed and distributed via this repo) oreasy_train_tesseract
(which has nothing to do withtesstrain
)? In the latter case, please post your issue there.
I am facing the same problem with both.
Okay, I am really not sure if I can help here. One problem which should be investigated is the fact that your CER is higher than your WER
At iteration 44600/100000/100000, Mean rms=0.3%, delta=0.633%, char train=8.195%, word train=3.399%, skip ratio=0%, wrote checkpoint.
This is odd. (Because each word holding an incorrectly recognized character should be juged as incorrect as well.) So maybe something is wrong with our training process. Also the high ratio between iteration and actually training-processed lines
At iteration 43291/86900/86900
could indicate a problem (1st vs. 2nd and 3rd number). But I am no expert for training with generated lines. If you do not mind, post a sample image from the training set and your manual test stage. Maybe this will help to shed some light on this issue.
@wrznr
One problem which should be investigated is the fact that your CER is higher than your WER This is odd. (Because each word holding an incorrectly recognized character should be juged as incorrect as well.)
This actually depends on the distribution of the errors. In the one extreme, if character errors are distributed across all words equally – CER will be much lower than WER. In the other, if character errors are focussed on a few words (or just one) – CER will possibly be higher than WER.
Plus in the case of Tesseract, CER and WER are measured as Bag-of-CER and Bag-of-WER, i.e. not via sequence alignment but as mere counts (across each line). (See #261 for details.) This makes such matters worse. It means that you could for example write each word's letters in random order – and still have zero BCER while having 100% BWER. Or you could insert an additional, very long all-garbage word – and get BCER close to 100% but BWER close to 0%. On the other hand, you could write each line's words in random order – and get zero BCER and BWER, although the text is completely illegible.
An additional problem in Tesseract's error metric is that the denominator of the rate calculations is the length of the GT sequence (but not the length of the alignment path). Thus, when the OCR sequence is longer than the GT sequence, the error rate gets overestimated (and can become larger than 100%).
Also the high ratio between iteration and actually training-processed lines
At iteration 43291/86900/86900
could indicate a problem (1st vs. 2nd and 3rd number
That's not a high ratio at all IMO. Even in training from scans you get ratios up to 10. It depends on how homogeneous the data are and how consistent with the start model (when finetuning).
Thanks, @wrznr @bertsky for your enriching discussion.
I have a question that would explain my issue. We know that we need single lines (.gt.txt and .png) to train the model. If I trained tesseract from scratch using single lines, can I use the trained language (traineddata) to extract text from an image that has multi-lines? Will tesseract do the line-segmentation for the image or do I need to do it myself and extract the text from each line?
I am asking that because all models I trained work only on single-line images.
@akmalkady you are confusing (text) recognition and (page) segmentation. Tesseract's recognition (like all modern OCR engines) operates on line images. The CLI and API also have page segmentation (at various levels), but this is not model-driven (trained/neural) but algorithmic (rule-based).
So in the most basic use-case, you pass a line image to the CLI and set --psm 13
(raw line): this will do no segmentation at all. But you can also enter on --psm 6
(block) with regions images or --psm 3
(page) with fullpage images. This will do layout analysis and then pass the segmented (and cropped) lines to its recognition and finally aggregate these results into the output for that page.
In the training phase, segmentation (for obvious reasons) is not used, so you are effectively in PSM 13.
The Makefile here uses --psm 13
(raw line, use the whole image) option to create training data from gt images.
When tesseract recognizes multiple lines (--psm 3
or --psm 6
), each line image is automatically cropped by tesseract and passed to network with an additional 4px of padding.
If the margin size(line spacing) of the gt images is significantly larger than the automatic crop by tesseract, the trained model may be overfitting for the margin size.
So I think you need to check what kind of images(result of line segmentation) are input to the network first.
Thanks, @nagadomi @bertsky. I used different psm options during the training and during extracting text but I am still having issues with multi-lines. Now, I realized that even when I add additional padding to the image before the extracting process tesseract performs very badly even when I use --psm 3 or --psm 6.
I will upload a zip file that has two directories that have the same images. One of them is without additional padding and the other one is with additional padding. Each image will have the --psm 3 output using my trained tesseract. (The model was trained using English text for demonstration)
It supposes to have the same results since we have the same images except for the padding.
++ I said maybe the issue from the installed tesseract. So I tried the language file on another machine, and I got the same results. ++ Also, I tried the default English language to see if the issue is from the installed tesseract but I got no issues.
If the multi-line test image has a very small font as well as its training data, that may be the cause.
4px padding
4x padding actually extends the line box area. So if the input image has a small font, the padding size(extend size) will be relatively large and the pixels of neighboring lines may be included in the line image. Example input image:
line images to be input to the network
The default eng
model works fine with these images, but probably will not work with models trained on images with only one line.
This issue can be fixed by resizing the input image to a larger size.
issue.zip
The text files in new-padding/ seems to be corrupted.
If the multi-line test image has a very small font as well as its training data, that may be the cause.
The first model I tried to train was using easy_train_tesseract by providing a training text to train tesseract from scratch. It was using text2image for one font type and the font size 12 (default). I had the same issue with extracting multi-line text from images.
Then, I used tesstrain to train a model also from scratch with data I generated for one font type and one font size 12. I had the same issue. Now, I am using tesstrain but this time for one font type but 8-16 font sizes. Here is a sample of the training data: training_sample.zip
Is using different font sizes is a wrong practice? I decided to use different fontsizes because when I trained the model using the font size 12, I found it performs well with images that have the font size 12, but when I tested an image with the font size 13, the result was bad.
The text files in new-padding/ seems to be corrupted.
The text files are the output I am getting from tesseract.
I will back to the other points you mentioned after answering my previous questions to more understand your points I really appreciate your time @nagadomi . Many thanks
--ptsize
option in tesstrain.py/text2image
is not the pixel size measure. I'm not very familiar with it, but when I check it with the image viewer, it's large then expected. If you specify ---save_box_tiff
option to tesstrain.py
, the tiff images will be saved in --output_dir
and you can check it.
For any font size, the line image will be resized to the network input size before being input (naturally, the resolution will be different). The network input size is specified at the beginning of the --net_spec
option. For example, [1,36,0, ....
will resize the height of the image to 36px.
I noticed the issue happen when I train tesseract using custom data (pairs of image/lines). This thing happened to me when I used tesstrain
and easy_train_tesseract
. But when I followed the instruction of TessTuotrial
, recognizing multi-lines worked on the model I trained using the generated lang-data by text2image
.
I am not sure if it is because the model has trained on multi-lines tiff
images or if the starting training data is the reason.
Greetings,
I have trained tesseract from scratch on a dataset of 100k lines (for one font type). I got New best char error = 1.187
I tested the trained language on an image that has 18 lines. I got very bad results:
Nothing was correct from the extracted text. Then I tried to segment the image into lines, and I tested every line and got around 85% of the correct chars. Is there any missing step?
Thank you