tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
640 stars 190 forks source link

"Compute CTC targets failed for xyz.lstmf!" for custom NET_SPECs #390

Closed yaofuzhou closed 6 months ago

yaofuzhou commented 6 months ago

Hi, I have successfully run the training program on a training set with the following NET_SPEC- NET_SPEC := [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c\#\#\#], which is slightly modified from that for an official Tesseract OCR model

Then when I did the same experimenting with the following NET_SPECs

NET_SPEC := [1,0,0,1 Ct5,5,32 Ct5,5,64 Mp3,3 Ct5,5,128 Mp3,3 Lfys128 Lbx256 Lfx512 O1c\#\#\#]
NET_SPEC := [1,0,0,1 Ct5,5,64 Ct5,5,128 Mp3,3 Ct5,5,256 Mp3,3 Lfys128 Lbx256 Lfx512 O1c\#\#\#]
NET_SPEC := [1,0,0,1 Ct5,5,32 Ct5,5,64 Mp3,3 Ct5,5,128 Mp3,3 Ct5,5,256 Lfys128 Lbx256 Lfx256 Lrx256 Lfx512 O1c\#\#\#]
NET_SPEC := [1,0,0,1 Ct7,7,32 Ct7,7,64 Mp3,3 Ct7,7,128 Mp3,3 Lfys128 Lbx256 Lfx512 O1c\#\#\#]
NET_SPEC := [1,0,0,1 Ct5,5,64 Ct5,5,128 Mp3,3 Ct5,5,256 Ct5,5,512 Mp3,3 Lfys128 Lbx256 Lfx512 O1cO1c\#\#\#]

They all resulted in the following Terminal output or similar:

~/Documents/OCR/tesstrain_units_2 (main*) » make training make[1]: Entering directory '/Users/admin/Documents/OCR/tesstrain_units_2' /Users/admin/Documents/OCR/tesseract/build/combine_lang_model \ --input_unicharset data/units/unicharset \ --script_dir data/langdata \ --numbers data/units/units.numbers \ --puncs data/units/units.punc \ --words data/units/units.wordlist \ --output_dir data \ \ --lang units Failed to read data from: data/units/units.wordlist Failed to read data from: data/units/units.punc Failed to read data from: data/units/units.numbers Loaded unicharset of size 4393 from file data/units/unicharset Setting unichar properties Other case Μ of μ is not in unicharset Other case Ν of ν is not in unicharset Other case Ζ of ζ is not in unicharset Other case Β of β is not in unicharset Other case Η of η is not in unicharset Other case Χ of χ is not in unicharset Other case Ε of ε is not in unicharset Other case Ι of ι is not in unicharset Other case Ρ of ρ is not in unicharset Other case Τ of τ is not in unicharset Other case Κ of κ is not in unicharset Other case Υ of υ is not in unicharset Other case Α of α is not in unicharset Setting script properties Warning: properties incomplete for index 3341 = , Warning: properties incomplete for index 4025 = ~ Warning: properties incomplete for index 4191 = 腘 Config file is optional, continuing... Failed to read data from: data/langdata/units/units.config Null char=2 Created data/units/units.traineddata /Users/admin/Documents/OCR/tesseract/build/lstmtraining \ --debug_interval 0 \ --traineddata data/units/units.traineddata \ --learning_rate 0.002 \ --net_spec "[1,0,0,1 Ct5,5,32 Ct5,5,64 Mp3,3 Ct5,5,128 Mp3,3 Lfys128 Lbx256 Lfx512 O1c4393]" \ --model_output data/units/checkpoints/units \ --train_listfile data/units/list.train \ --eval_listfile data/units/list.eval \ --max_iterations -16 \ --target_error_rate 0.005 \ --momentum 0.5 \ --adam_beta 0.999 \ --perfect_sample_delay 0 \ 2>&1 | tee -a data/units/training.log Warning: given outputs 4393 not equal to unicharset of 265. Num outputs,weights in Series: 1,0,0,1:1, 0 Num outputs,weights in Series: C5,5:25, 0 Ft32:32, 832 Total weights = 832 [C5,5Ft32]:32, 832 Num outputs,weights in Series: C5,5:800, 0 Ft64:64, 51264 Total weights = 51264 [C5,5Ft64]:64, 51264 Mp3,3:64, 0 Num outputs,weights in Series: C5,5:1600, 0 Ft128:128, 204928 Total weights = 204928 [C5,5Ft128]:128, 204928 Mp3,3:128, 0 TxyLfys128:128, 131584 Lbx256:512, 788480 Lfx512:512, 2099200 Fc265:265, 135945 Total weights = 3412233 Built network:[1,0,0,1[C5,5Ft32][C5,5Ft64]Mp3,3[C5,5Ft128]Mp3,3TxyLfys128Lbx256Lfx512Fc265] from request [1,0,0,1 Ct5,5,32 Ct5,5,64 Mp3,3 Ct5,5,128 Mp3,3 Lfys128 Lbx256 Lfx512 O1c4393] Training parameters: Debug interval = 0, weights = 0.1, learning rate = 0.002, momentum=0.5 null char=264 Compute CTC targets failed for /Users/admin/Documents/OCR/generated_lines/manipulated_images/004/336/167.lstmf! Compute CTC targets failed for /Users/admin/Documents/OCR/generated_lines/manipulated_images/002/866/877.lstmf! Compute CTC targets failed for /Users/admin/Documents/OCR/generated_lines/manipulated_images/007/386/256.lstmf! Compute CTC targets failed for /Users/admin/Documents/OCR/generated_lines/manipulated_images/003/777/429.lstmf! ... ...

All of the successful and attempted training runs are based on the exact same set of .lstmf, .box., .gt.txt, and .png files.

I suspect the errors of Compute CTC targets failed for ... are out of my misunderstanding of the VGSL syntax https://tesseract-ocr.github.io/tessdoc/tess4/VGSLSpecs.html. If so, I wish to know complete set of rules allowed in NET_SPEC. Thanks.

bertsky commented 6 months ago

AFAIK the Compute CTC targets failed for is merely a problem on a single line pair and not directly indicative of a problem with the network topology. Did you inspect those samples visually?

Also, why do you say these are terminal? Does training not continue?

You already linked to the VGSL docs, which are pretty comprehensive. Here is the implementation (the spec parser).

In my experience, the problem with custom net specs is more with getting training to converge to low error rates at all. Usually it stays in the high nineties percentage BCER. Once you found a workable spec, you may still need to set a large max iterations to even see the initial drop in error rate, esp. if you have many CNN layers. (Note that Tesseract has no 1d or 2d dropout, so training large networks is much harder, perhaps best attempted via append/impact strategy ...)

Your configurations should be fine IMO – what kind of material are you trying?

yaofuzhou commented 6 months ago

@bertsky Thank you for sharing your insights and the references! They are all very relevant to my project.

Back to the topic -

  1. Sorry for the confusion. By "Terminal," I just meant the shell console for Macbook. I was merely showing you the error message from Tesstrain.

  2. Yes, I visually inspected the .box, .gt.txt, .png files, which were procedurely generated for my project. The .lstmf files were generated during make lists, and were not modified since the successful run of NET_SPEC- NET_SPEC := [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c\#\#\#]

  3. I read in a recent update that .lstmf files are no longer needed for the training process. If so, after removing the .lstmf files from my training set directory, what info do I write to list.train and list.eval? For me, these two list files used to list all .lstmf files, which are based on .png and .box files I provided. What and how to tell the training process if I want to train on the .png and .box files?

  4. To your last question, again, I am trying to generate a more powerful OCR model for a mix of Chinese language and some mathematical symbols. The goal is to have a model that is more capable of dealing with various noises. I have therefore procedurally generated 10 million text lines (I modified the makefile to accommodate a more complex directory structure to host this many files) with varying fonts, tilt, background gridlines, printer/ink/camera effects. All images were precisely labeled for each character in the .box file during the generation process.

I started by fine-tuning the existing chi_sim model, and it plateaued at about 3% error rate for a while before I decided to try a larger model which will hopefully be able to absorb the added complexity of my training information.

Then this net spec NET_SPEC := [1,64,0,1 Ct5,5,32 Mp3,3 Lfys128 Lfx256 Lrx256 Lfx1024 O1c\#\#\#] got about 35% after a few hundreds of thousands of iterations and entered a long plateau. That was when I wanted to try the 5 proposed net specs. However, from your feedback, the issue may not be entirely the size of the net, and the lack of the dropout mechanism may be a major factor. Should I do some hacking and implement the dropout mechanism myself?

Any experience or insights in training more powerful than official Tesseract OCR models will be greatly appreciated.

bertsky commented 6 months ago

Regarding 2: I specifically meant the sample files which produced the warnings. Sometimes (esp. with synthetic GT) individual line images really are bogus (inverted / wrongly cropped / noisy) and then that's the reason for the CTC target failure. (In that case you better remove them from list.train and list.eval.)

3: that's not correct – the .lstmf files contain both the image matrix and the text and are the basic input for the trainer. Removing them would not work (since Tesseract has no in-memory training, everything is loaded from disk).

4: Interesting!

For Chinese with its huge character set, make sure you use RECODER=--pass_through_recoder and NORM_MODE=2 because of sparsity.

10M should be enough for training of a deep net from scratch with degraded images. And finetuning a smaller model on this large dataset would not make sense (so the plateauing is not surprising). (The stock model was trained on low-quality text with 12M character tokens / 4k character types and 173 fonts with next to no image degradation.)

I assume you are using greyscale images? (For colour you would need to start the net spec with 1,0,0,3.)

However, I wonder how tilt / slant is useful or realistic in Chinese writing...

Did you use some publicly known method for generating and degrading the images?

Setting width/height to zero (=auto) is important in my experience. So both your initial 48 and 64 may have caused suboptimal learning. I would suggest waiting for one of the 5 other architectures to converge.

Unfortunately, Tesseract training is quite non-standard and complex (due to heavy optimisation efforts in the code), with features like subtrainer and non-features like checkpointing from training error minima (instead of validation error minima). So I would suggest using the plot device to monitor what is going on (not just staring at the textual log output). It's not Tensorboard, but it at least gives you some idea.

Once you know the true new plateau, you can still invest more effort (like pretraining shallower nets and then appending iteratively). If you can really implement dropout – everyone would love you for it. But it is probably very hard to do, judging by the quality of the code.

bertsky commented 6 months ago

3. in a recent update that .lstmf files are no longer needed

ah, did not see this yet. But I would not trust it until thoroughly tested...

yaofuzhou commented 6 months ago

@bertsky Thank you again for sharing all of your insights!

Did you use some publicly known method for generating and degrading the images? Yes and no -

  • I occasionally add a line with random but reasonable thickness above and/or below my texts to simulate the texts near gridlines of a table
  • I randomly tilt the text stripe up to 10 degrees, and I randomly scale down the image in the y direction to simulate less-than-ideal camera/scanner placement
  • I use random numbers to add and/or subtract values to/near gray/black pixels to simulate dirty printers and different lighting conditions.
  • Still in progress, but I use various combinations of signal analysis techniques to extract noises due to paper textures and lighting conditions from a set of sample images specific to my project.

I hope that would be a comprehensive list of effects for the more general purpose...

I re-shuffled my list.train and list.eval and make training seems to be working again. I will close the issue and let you know if great things happen ;)

bertsky commented 6 months ago

I see.

  • I occasionally add a line with random but reasonable thickness above and/or below my texts to simulate the texts near gridlines of a table

That's a neat idea. I have not seen this supported by any synthetic data generation tool yet.

  • I randomly tilt the text stripe up to 10 degrees, and I randomly scale down the image in the y direction to simulate less-than-ideal camera/scanner placement

Oh, I see. Yes, tilt in the sense of rotation is always useful. If you want perspective (tilted camera angle) though, don't just modify aspect ratio, directly apply keystone effect (e.g. with ImageMagick's convert -distort Arc 10%).

  • I use random numbers to add and/or subtract values to/near gray/black pixels to simulate dirty printers and different lighting conditions.

Good idea. Not sure I have ever seen that either so far.

Perhaps you want to play around with other tools:

Thank you for explaining your work. Good luck – hope to hear from you again!