Closed yaofuzhou closed 6 months ago
AFAIK the Compute CTC targets failed for
is merely a problem on a single line pair and not directly indicative of a problem with the network topology. Did you inspect those samples visually?
Also, why do you say these are terminal
? Does training not continue?
You already linked to the VGSL docs, which are pretty comprehensive. Here is the implementation (the spec parser).
In my experience, the problem with custom net specs is more with getting training to converge to low error rates at all. Usually it stays in the high nineties percentage BCER. Once you found a workable spec, you may still need to set a large max iterations to even see the initial drop in error rate, esp. if you have many CNN layers. (Note that Tesseract has no 1d or 2d dropout, so training large networks is much harder, perhaps best attempted via append/impact strategy ...)
Your configurations should be fine IMO – what kind of material are you trying?
@bertsky Thank you for sharing your insights and the references! They are all very relevant to my project.
Back to the topic -
Sorry for the confusion. By "Terminal," I just meant the shell console for Macbook. I was merely showing you the error message from Tesstrain.
Yes, I visually inspected the .box, .gt.txt, .png files, which were procedurely generated for my project. The .lstmf files were generated during make lists
, and were not modified since the successful run of NET_SPEC- NET_SPEC := [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c\#\#\#]
I read in a recent update that .lstmf
files are no longer needed for the training process. If so, after removing the .lstmf
files from my training set directory, what info do I write to list.train
and list.eval
? For me, these two list files used to list all .lstmf
files, which are based on .png
and .box
files I provided. What and how to tell the training process if I want to train on the .png
and .box
files?
To your last question, again, I am trying to generate a more powerful OCR model for a mix of Chinese language and some mathematical symbols. The goal is to have a model that is more capable of dealing with various noises. I have therefore procedurally generated 10 million text lines (I modified the makefile to accommodate a more complex directory structure to host this many files) with varying fonts, tilt, background gridlines, printer/ink/camera effects. All images were precisely labeled for each character in the .box
file during the generation process.
I started by fine-tuning the existing chi_sim
model, and it plateaued at about 3% error rate for a while before I decided to try a larger model which will hopefully be able to absorb the added complexity of my training information.
Then this net spec NET_SPEC := [1,64,0,1 Ct5,5,32 Mp3,3 Lfys128 Lfx256 Lrx256 Lfx1024 O1c\#\#\#]
got about 35% after a few hundreds of thousands of iterations and entered a long plateau. That was when I wanted to try the 5 proposed net specs. However, from your feedback, the issue may not be entirely the size of the net, and the lack of the dropout mechanism may be a major factor. Should I do some hacking and implement the dropout mechanism myself?
Any experience or insights in training more powerful than official Tesseract OCR models will be greatly appreciated.
Regarding 2: I specifically meant the sample files which produced the warnings. Sometimes (esp. with synthetic GT) individual line images really are bogus (inverted / wrongly cropped / noisy) and then that's the reason for the CTC target failure. (In that case you better remove them from list.train and list.eval.)
3: that's not correct – the .lstmf files contain both the image matrix and the text and are the basic input for the trainer. Removing them would not work (since Tesseract has no in-memory training, everything is loaded from disk).
4: Interesting!
For Chinese with its huge character set, make sure you use
RECODER=--pass_through_recoder
and NORM_MODE=2
because of sparsity.
10M should be enough for training of a deep net from scratch with degraded images. And finetuning a smaller model on this large dataset would not make sense (so the plateauing is not surprising). (The stock model was trained on low-quality text with 12M character tokens / 4k character types and 173 fonts with next to no image degradation.)
I assume you are using greyscale images? (For colour you would need to start the net spec with 1,0,0,3
.)
However, I wonder how tilt / slant is useful or realistic in Chinese writing...
Did you use some publicly known method for generating and degrading the images?
Setting width/height to zero (=auto) is important in my experience. So both your initial 48 and 64 may have caused suboptimal learning. I would suggest waiting for one of the 5 other architectures to converge.
Unfortunately, Tesseract training is quite non-standard and complex (due to heavy optimisation efforts in the code), with features like subtrainer and non-features like checkpointing from training error minima (instead of validation error minima). So I would suggest using the plot
device to monitor what is going on (not just staring at the textual log output). It's not Tensorboard, but it at least gives you some idea.
Once you know the true new plateau, you can still invest more effort (like pretraining shallower nets and then appending iteratively). If you can really implement dropout – everyone would love you for it. But it is probably very hard to do, judging by the quality of the code.
3. in a recent update that
.lstmf
files are no longer needed
ah, did not see this yet. But I would not trust it until thoroughly tested...
@bertsky Thank you again for sharing all of your insights!
Did you use some publicly known method for generating and degrading the images? Yes and no -
- I occasionally add a line with random but reasonable thickness above and/or below my texts to simulate the texts near gridlines of a table
- I randomly tilt the text stripe up to 10 degrees, and I randomly scale down the image in the y direction to simulate less-than-ideal camera/scanner placement
- I use random numbers to add and/or subtract values to/near gray/black pixels to simulate dirty printers and different lighting conditions.
- Still in progress, but I use various combinations of signal analysis techniques to extract noises due to paper textures and lighting conditions from a set of sample images specific to my project.
I hope that would be a comprehensive list of effects for the more general purpose...
I re-shuffled my list.train
and list.eval
and make training
seems to be working again. I will close the issue and let you know if great things happen ;)
I see.
- I occasionally add a line with random but reasonable thickness above and/or below my texts to simulate the texts near gridlines of a table
That's a neat idea. I have not seen this supported by any synthetic data generation tool yet.
- I randomly tilt the text stripe up to 10 degrees, and I randomly scale down the image in the y direction to simulate less-than-ideal camera/scanner placement
Oh, I see. Yes, tilt in the sense of rotation is always useful. If you want perspective (tilted camera angle) though, don't just modify aspect ratio, directly apply keystone effect (e.g. with ImageMagick's convert -distort Arc 10%
).
- I use random numbers to add and/or subtract values to/near gray/black pixels to simulate dirty printers and different lighting conditions.
Good idea. Not sure I have ever seen that either so far.
Perhaps you want to play around with other tools:
ketos linegen
from KrakenThank you for explaining your work. Good luck – hope to hear from you again!
Hi, I have successfully run the training program on a training set with the following
NET_SPEC
-NET_SPEC := [1,48,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c\#\#\#]
, which is slightly modified from that for an official Tesseract OCR modelThen when I did the same experimenting with the following
NET_SPEC
sThey all resulted in the following Terminal output or similar:
All of the successful and attempted training runs are based on the exact same set of .lstmf, .box., .gt.txt, and .png files.
I suspect the errors of
Compute CTC targets failed for ...
are out of my misunderstanding of the VGSL syntax https://tesseract-ocr.github.io/tessdoc/tess4/VGSLSpecs.html. If so, I wish to know complete set of rules allowed inNET_SPEC
. Thanks.