Add Old Persian language

tesseract-ocr / tessdata

Trained models with fast variant of the "best" LSTM models + legacy models

Apache License 2.0

6.46k stars 2.2k forks source link

Add Old Persian language #184

Closed Melanee-Melanee closed 1 month ago

Melanee-Melanee commented 1 month ago

Dear manager

I am an AI developer and currently trained a new Tesseract language model for Old Persian language. My new model (op.traineddata) works properly for Old Persian language and I have published it on my GitHub repository:

https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR

Additionally, that would be my honor to pull my new trained language model on your repository to be available by other developers. To test my model, you can use these custom Old Persian images:

https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR/tree/master/other/custom%20images

Moreover, I have published my new paper regarding to my new model:

https://www.researchgate.net/publication/382528886_Translating_Old_Persian_cuneiform_by_artificial_intelligence_AI

I hope my new uploaded model (op.traineddata) will be merged on your repository.

Sincerely

Melanee

zdenop commented 1 month ago

Non-Google (user-contributed) training data has its own repository: https://github.com/tesseract-ocr/tessdata_contrib.

Please take a look there for inspiration on how your PR should be structured.

stweil commented 1 month ago

@Melanee-Melanee, tessdata_contrib already contains a model for cuneiform (akk.traineddata). But you write in your paper that Tesseract is not the right choice for handwritten scripts, and I agree. You did not mention opr.traineddata in your paper. Where do you describe this model, and where exactly can it be found? I also suggest to fix the typo "tessearct" (for exampe tessearct_old_persian) in your paper and repository (even in filenames).

Melanee-Melanee commented 1 month ago

Thanks a lot @stweil

Akkadian cuneiform is different with Old Persian cuneiform. The types of cuneiform inscriptions such as Sumerian, Akkadian, Babylonian, Assyrian, Elamite, Hittite, Urartian, and Old Persian, each of them is a unique language.

The name of my new tesseract model on my GitHub is myLang.traineddata instead of op.traineddata, shall I rename it? You can find my model here:

https://github.com/Melanee-Melanee/Old-Persian-Cuneiform-OCR/blob/master/tesseract_old_persian/myLang.traineddata

Besides, I did correct my dictation error for "tessearct", I am grateful for informing me.

Melanee-Melanee commented 1 month ago

Thank you @zdenop

So you mean I must pull my new trained data on: https://github.com/tesseract-ocr/tessdata_contrib ?

Ok, I will.

zdenop commented 1 month ago

So you mean I must pull my new trained data on: https://github.com/tesseract-ocr/tessdata_contrib ?

Yes, please. And provide also additional information about module (Is it best, fast or legacy model ?) - see how others did it.

Melanee-Melanee commented 1 month ago

@zdenop @stweil I sent my new pull request on https://github.com/tesseract-ocr/tessdata_contrib.

Please check it, Thanks a lot for your collaboration.