[Question] How to generate tif line images from tif pages / How to train with no specified language?

Marco-Parente commented 2 years ago

Hello, good morning :)

I'm wanting kind of a general help, cause I'm a bit lost, so sorry if it's something dumb.

I'm wanting to train the tesseract model to be good with brazilian car lincese-plate characters, so i've used a regex text generator to generate 100.000 lines of characters in a way they have a format kinda like what we use here...

Then, i have downloaded the lstm files and Inside the eng lstm folder, i've replaced the content inside eng.training_text with the plate-like text I generated, cause I the previous content would have characters and text format I won't use (I only need the [A-Z] and [0-9] characters)

After that, i used the following:

python3 src/training/tesstrain.py --fonts_dir /Documents/dev/tesseract-tutorial/fonts --fontlist 'FE-Font' 'Mandatory' --lang eng --linedata_only --langdata_dir /Documents/dev/tesseract-tutorial/langdata_lstm --tessdata_dir /Documents/dev/tesseract-tutorial/tesseract/tessdata --save_box_tiff --maxpages 200 --output_dir train

I've put the eng lang cause its needed, but there would be no specific language actually there, cause its only license plates characters... right?

After this command, i got tif, lstmf and box files for the fonts I've used, but they have multiple lines and multiple pages (200)

After looking the docs for a while, I've seen that with https://github.com/tesseract-ocr/tesstrain/issues/7 script you can transform png pages to tif one-line-image with the respective transcriptions... but i didn't see a way to do that with tif images

So I wanted to ask the following:

Can I make the tif generation without specifying the language?
Can i take the multi-line / multi-page tif files and transform it to one-line tif that are needed for training?
Am I doing this training process the right way or am I complicating things?

Thanks in advance!

Shreeshrii commented 2 years ago

tesstrain.py creates the lstmf files which can be directly used by lstmtraining. However, the tesstrain Makefile does not directly support those.

Please see https://github.com/Shreeshrii/tess5train-fonts/blob/main/license_plate.sh and https://github.com/Shreeshrii/tess5train-fonts/tree/main/data/BrazilPlates https://github.com/Shreeshrii/tess5train-fonts/blob/main/data/BrazilPlates/plots/BrazilPlates-6.png

These show result of a test training I did by finetuning eng.traineddata.

As the plot shows, waiting for training to reach the target error rate leads to overfitting. Best results may be seen by using the traineddata files from the 400-700 checkpoints. You can test with real life images and verify results.

Also, as @stweil had mentioned recently in a related thread, you can finetune with 100+ real life single line images of license plates and their ground-truth using tesstrain Makefile.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tesseract-ocr / tesstrain

[Question] How to generate tif line images from tif pages / How to train with no specified language? #302