Closed wincentbalin closed 3 years ago
Do you have trained models for Akkadian? If yes: Are they for the legacy OCR engine or for the newer one which uses a neural network, and are they already used somewhere?
Indeed I have trained models: one for Tesseract 3.04, another one for Tesseract 4 (trained with the same recipe as @Shreeshrii did in tesstrain-akk for 1 million epochs), and another one training currently for Tesseract 4 with the pythonised tesstrain toolset for 2 million epochs, which I will evaluate then.
Having this models trained, I will try to integrate them into the Text Fairy app, either by kindly asking @renard314, or by forking and adding the model by myself. Apart from that, these models are not used anywhere yet.
Latest versions of Text fairy are no longer open source but I'll gladly add the model once its ready.
@renard314, which model does the current version of Text Fairy need, Tesseract 3 or Tesseract 4?
@renard314, which model does the current version of Text Fairy need, Tesseract 3 or Tesseract 4?
Textfairy is using Tess 5, so I need an LSTM model (preferably integerized like tessdata fast)
Dear @stweil , dear @renard314 ,
as they say in Germany, "Was lange währt, wird endlich gut". I did train/finetune multiple versions of .traineddata
files until recently, both for the NN-based and for the legacy Tesseract. All of the trained with the langdata from this pull request, with 9 different fonts.
I attach a .zip
archive with files with the best parameters: akk-traineddata.zip. You can read from the filenames, what engine they were trained for and whether they are the best or the fast version.
Feel free to test them! And then, if the results are sufficient, I would like to see this pull request being merged into this repository.
Thank you for your contribution and your patience.
I applied your changes to https://github.com/tesseract-ocr/langdata_lstm/ which is used for Tesseract 4 and newer.
Maybe you can make a pull request there to add akk.unicharset
which is still missing.
Sorry, I just saw that you also trained a legacy model, so merging your contribution to langdata
is also reasonable.
@wincentbalin, can you describe all steps required for the training which you have done, perhaps in the Wiki? That would help making future enhancements, fixes or updates.
@wincentbalin, can you describe all steps required for the training which you have done, perhaps in the Wiki?
@stweil, do you mean the tessdoc repository or another Wiki?
Also, to which repositories should each of the .traineddata
files from https://github.com/tesseract-ocr/langdata/pull/150#issuecomment-885940792 go?
do you mean the tessdoc repository or another Wiki?
I suggest to use https://github.com/tesseract-ocr/tesstrain/wiki.
do you mean the tessdoc repository or another Wiki?
I suggest to use https://github.com/tesseract-ocr/tesstrain/wiki.
It seems I cannot edit that wiki. Do I need certain permissions to do so?
Sorry, you are right. I fixed the settings, so editing should now work for you.
@stweil, I've recorded the training steps for Akkadian here. Would you like to give it a look?
@renard314 Finally! You can add Akkadian language OCR to TextFairy. The URL to the fast LSTM model is https://github.com/tesseract-ocr/tessdata_contrib/raw/main/akk/fast/akk.traineddata .
This is Akkadian langdata, with text corpus created from contents of ORACC (http://oracc.museum.upenn.edu/).