Open ehrenmann1977 opened 2 years ago
Good question. Tesseract uses its own model file format. But it should be possible to convert the included neural network to any other model format which supports the same network specification.
We still have to find someone who wants to implement that (and also the other direction).
Is there any documentation available on the model file format Tesseract uses (*.traineddata
file format specification)?
There exists a command line tool combine_tessdata
which can list and extract all components from a model file:
% combine_tessdata -d /opt/homebrew/share/tessdata/eng.traineddata
Version:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054
Another tool dawg2wordlist
can convert the dawg components to normal text files, and the unicharset is already text. That's the easy part.
The interesting part is the lstm component with the neural network. It's not documented, so the program code is the reference for it. Look for DeSerialize
in the lstm code.
how to export a Keras model of English language? is it possible to export the corpus to do some neural network training using it? I mean something like MNIST dataset