tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
620 stars 181 forks source link

"Compute CTC targets failed!" #210

Closed stefanCCS closed 3 years ago

stefanCCS commented 3 years ago

Hi, I get the following error message during training. The training goes on, and the result looks ok, but still it might be worth to have a look at this:

Compute CTC targets failed!

The training I have started with:

make MAX_ITERATIONS=300 MODEL_NAME=swerror02 GROUND_TRUTH_DIR=~/tesstrain/data/swerror02-ground-truth/ START_MODEL=grc TESSDATA=/usr/share/tesseract-ocr/5/tessdata training

One topic I changed in comparison to standard installation: I have changed shuffle.py, so, that the files are not shuffeled at all (to make any reproducing of failures a bit easier) like this:

# Then shuffle the lines.
if len(sys.argv) > 1:
        if (sys.argv[1] != "0"):
                random.shuffle(lines)

The ground-truth-data you can find here attached: swerror02-ground-truth.zip

Shreeshrii commented 3 years ago

The total groundtruth images are very limited. Out of that many images are cropped incorrectly. In some cases the bottom part of one text line and top of next text line are in the image. In other cases 4-5 text lines are in the image and are cropped in multiple pieces.

Additionally some lines are at very large font size. For finetune with such small training set all lines need to be of similar type.

In thousands of lines a few such samples may not effect much, but when almost 15 out of 60 lines have a problem, you will not get good result.

Some problem images posted below.

GT-000009 GT-000004 GT-000005 GT-000006 GT-000007 GT-000008 GT-000046 GT-000047 GT-000048 GT-000049 GT-000043 GT-000044 GT-000045

stefanCCS commented 3 years ago

Many thanks for providing this fast answer. For explanation: I am aware of that my examle I have provided here is a very small ground truth. In the real project my ground truth is about 30000 lines. I have put this here to understand, why I get this "Compute CTC targets failed" error. And for this I am still not sure, what is the root cause for this:

And, what I am also interested in: Is this error "Compute CTC targets failed" somehow critical for my training?

Please, give me another advice to these questions.

Shreeshrii commented 3 years ago

The problem is with badly cropped lines. If I remove them CTC errors go away. See attached log.

swerror02.log

Is this error "Compute CTC targets failed" somehow critical for my training?

It depends on how many and what proportion of total training data they are? You could ignore if it was 15 out of all 30000 lines, you shouldn't if it is 15 out of 60 lines.

stefanCCS commented 3 years ago

Again, many thanks for answering. This means I will take care mainly of the not good input data: Concerning "CTC" error - I will take care of the badly cropped lines. And, of course I will also take care of the other bad input data.