tesseract-ocr / tessdata_best

Best (most accurate) trained LSTM models.
Apache License 2.0
1.24k stars 381 forks source link

convert eng training to h5 model #71

Open ehrenmann1977 opened 2 years ago

ehrenmann1977 commented 2 years ago

how to export a Keras model of English language? is it possible to export the corpus to do some neural network training using it? I mean something like MNIST dataset

stweil commented 1 year ago

Good question. Tesseract uses its own model file format. But it should be possible to convert the included neural network to any other model format which supports the same network specification.

We still have to find someone who wants to implement that (and also the other direction).

stefan6419846 commented 1 year ago

Is there any documentation available on the model file format Tesseract uses (*.traineddata file format specification)?

stweil commented 1 year ago

There exists a command line tool combine_tessdata which can list and extract all components from a model file:

% combine_tessdata -d /opt/homebrew/share/tessdata/eng.traineddata 
Version:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054

Another tool dawg2wordlist can convert the dawg components to normal text files, and the unicharset is already text. That's the easy part.

The interesting part is the lstm component with the neural network. It's not documented, so the program code is the reference for it. Look for DeSerialize in the lstm code.