tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Add Akkadian langdata #150

Closed wincentbalin closed 3 years ago

wincentbalin commented 4 years ago

This is Akkadian langdata, with text corpus created from contents of ORACC (http://oracc.museum.upenn.edu/).

stweil commented 3 years ago

Do you have trained models for Akkadian? If yes: Are they for the legacy OCR engine or for the newer one which uses a neural network, and are they already used somewhere?

wincentbalin commented 3 years ago

Indeed I have trained models: one for Tesseract 3.04, another one for Tesseract 4 (trained with the same recipe as @Shreeshrii did in tesstrain-akk for 1 million epochs), and another one training currently for Tesseract 4 with the pythonised tesstrain toolset for 2 million epochs, which I will evaluate then.

Having this models trained, I will try to integrate them into the Text Fairy app, either by kindly asking @renard314, or by forking and adding the model by myself. Apart from that, these models are not used anywhere yet.

renard314 commented 3 years ago

Latest versions of Text fairy are no longer open source but I'll gladly add the model once its ready.

wincentbalin commented 3 years ago

@renard314, which model does the current version of Text Fairy need, Tesseract 3 or Tesseract 4?

renard314 commented 3 years ago

@renard314, which model does the current version of Text Fairy need, Tesseract 3 or Tesseract 4?

Textfairy is using Tess 5, so I need an LSTM model (preferably integerized like tessdata fast)

wincentbalin commented 3 years ago

Dear @stweil , dear @renard314 ,

as they say in Germany, "Was lange währt, wird endlich gut". I did train/finetune multiple versions of .traineddata files until recently, both for the NN-based and for the legacy Tesseract. All of the trained with the langdata from this pull request, with 9 different fonts.

I attach a .zip archive with files with the best parameters: akk-traineddata.zip. You can read from the filenames, what engine they were trained for and whether they are the best or the fast version.

Feel free to test them! And then, if the results are sufficient, I would like to see this pull request being merged into this repository.

stweil commented 3 years ago

Thank you for your contribution and your patience.

I applied your changes to https://github.com/tesseract-ocr/langdata_lstm/ which is used for Tesseract 4 and newer.

Maybe you can make a pull request there to add akk.unicharset which is still missing.

stweil commented 3 years ago

Sorry, I just saw that you also trained a legacy model, so merging your contribution to langdata is also reasonable.

stweil commented 3 years ago

@wincentbalin, can you describe all steps required for the training which you have done, perhaps in the Wiki? That would help making future enhancements, fixes or updates.

wincentbalin commented 3 years ago

@wincentbalin, can you describe all steps required for the training which you have done, perhaps in the Wiki?

@stweil, do you mean the tessdoc repository or another Wiki?

Also, to which repositories should each of the .traineddata files from https://github.com/tesseract-ocr/langdata/pull/150#issuecomment-885940792 go?

stweil commented 3 years ago

do you mean the tessdoc repository or another Wiki?

I suggest to use https://github.com/tesseract-ocr/tesstrain/wiki.

wincentbalin commented 3 years ago

do you mean the tessdoc repository or another Wiki?

I suggest to use https://github.com/tesseract-ocr/tesstrain/wiki.

It seems I cannot edit that wiki. Do I need certain permissions to do so?

stweil commented 3 years ago

Sorry, you are right. I fixed the settings, so editing should now work for you.

wincentbalin commented 2 years ago

@stweil, I've recorded the training steps for Akkadian here. Would you like to give it a look?

wincentbalin commented 1 year ago

@renard314 Finally! You can add Akkadian language OCR to TextFairy. The URL to the fast LSTM model is https://github.com/tesseract-ocr/tessdata_contrib/raw/main/akk/fast/akk.traineddata .