manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.64k stars 194 forks source link

GUI to assist in fine tuning/teaching Tesseract on scanned images #529

Open AvtechScientific opened 3 years ago

AvtechScientific commented 3 years ago

It would be nice to have GUI elements that would assist in fine tuning/teaching Tesseract on scanned images. Similar to what jTessBoxEditor does, as described in this article[^*]. Mainly creating the .tiff and .box files...

[^*]: not all the commands listed in the article worked for me. Here are those corrected by me a bit:

java -jar jTessBoxEditor.jar

tesseract --psm 6 --oem 3 font_name.font.exp0.tif font_name.font.exp0 makebox

nano font_properties
font 0 0 0 0 0

# Create a .tr file (training file)
tesseract font_name.font.exp0.tif font_name.font.exp0 nobatch box.train

# Create a unicharset file
unicharset_extractor font_name.font.exp0.box

# Create a shapetable file
shapeclustering -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

# Create a pffmtable, intemp file
mftraining -F font_properties -U unicharset -O font_name.unicharset font_name.font.exp0.tr

# Create a normproto file
cntraining font_name.font.exp0.tr

mv shapetable font_name.shapetable
mv normproto font_name.normproto
mv pffmtable font_name.pffmtable
mv inttemp font_name.inttemp

combine_tessdata font_name.

Now copy font_name.traineddata to :
sudo cp font_name.traineddata /usr/share/tesseract-ocr/4.00/tessdata/

Now test new traindata:
tesseract test_numbers.png stdout -l font_name 
AKmatiAK commented 2 years ago

Yes, this is one of basic features neccesary for OCR program. If it will get added I can donate to support development. Just make simple gui to modify tesseract configuration file with short description of parameter on hover.

manisandro commented 2 years ago

Probably the fastest way to achieve this is if someone contributed the code via PR. On my part I won't have the capacity to work on this in the near future.

khashashin commented 1 year ago

I created a simple Python script that extracts the boxes from the HTML file. In gImageReader you should export the edited image as HTML and then use the script to extract the boxes: https://github.com/khashashin/chechen_ocr