tesseract-ocr / tessdata

Trained models with fast variant of the "best" LSTM models + legacy models
Apache License 2.0
6.25k stars 2.15k forks source link

Added best traineddatas for 4.00 alpha #62

Open amitdo opened 7 years ago

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tessdata/tree/3a94ddd47be0

@theraysmith , How to present those 'best' files to our users? https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

Do you plan to push more updates to the best directory and/or to the root dir in the next few weeks?

stweil commented 7 years ago

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

amitdo commented 7 years ago

Related comment from Ray: https://github.com/tesseract-ocr/tesseract/issues/995#issuecomment-314609036

2 parallel sets of tessdata. "best" and "fast". "Fast" will exceed the speed of legacy Tesseract in real time, provided you have the required parallelism components, and in total CPU only slightly slower for English. Way faster for most non-latin languages, while being <5% worse than "best" Only "best" will be retrainable, as "fast" will be integer.

amitdo commented 7 years ago

My guess is that the upper case traineddata files are for 'one script multi langs'.

theraysmith commented 7 years ago

I'm currently working on the training documentation, before committing more code, so as not to leave training broken for more than maybe an hour or so. Here's a quick bullet list of what's going on:

On Tue, Aug 1, 2017 at 6:50 AM, Amit D. notifications@github.com wrote:

My guess is that the upper case traineddata files are for 'one script multi lang'.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319375598, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_- .

-- Ray.

Shreeshrii commented 7 years ago

Ray,

Please see Devanagari feedback at https://github.com/tesseract-ocr/tessdata/issues/66 https://github.com/tesseract-ocr/tessdata/issues/64

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 1, 2017 at 11:08 PM, theraysmith notifications@github.com wrote:

I'm currently working on the training documentation, before committing more code, so as not to leave training broken for more than maybe an hour or so. Here's a quick bullet list of what's going on:

  • Initial capitals indicate the one model for all langs in that script, so eg Latin is all latin-based languages except vie, which has its own Vietnamese. Most of the script models include English training data as well as the script, but not for Cyrillic, as that would have a major ambiguity problem. Devanagari is hin+san+mar+nep+eng, and Fraktur is basically a combination of all the latin-based languages that have an 'old' variant, etc... I would be interested to hear more feedback on the Script models as Stefan already provided for Fraktur.
  • The tessdata directory doesn't have to be called tessdata any more, so I was thinking of a structuring that allowed maybe best, fast and legacy as separate directories or repos.
  • I noticed git complain about the size of Latin.traineddata (~100MB), but didn't yet follow the pointer to git large data.
  • The current code can run the 'best' models, and the existing models, but incremental and fine tuning training will be tied to 'best' with a future commit/push. (Due to a switch to ADAM and the move of the unicharset/recoder).
  • Fine tuning/incremental training will not be possible from the 'fast' models, as they are 8-bit integer. It will be possible to convert a tuned best to integer to make it faster, but some of the speed in 'fast' will be from the smaller model.
  • It will be possible to add new characters by fine tuning! I got that working yesterday, and just need to finish updating the documentation.

On Tue, Aug 1, 2017 at 6:50 AM, Amit D. notifications@github.com wrote:

My guess is that the upper case traineddata files are for 'one script multi lang'.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/62# issuecomment-319375598, or mute the thread https://github.com/notifications/unsubscribe-auth/ AL056YqGx17BmwGUzEaK5AFDE67fqr_rks5sTy0ggaJpZM4OpU_- .

-- Ray.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319442674, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7elr2M-RcaG2eiGMykqVylg0uQ1ks5sT2KmgaJpZM4OpU_- .

amitdo commented 7 years ago

New traineddata files: Arabic.traineddata Armenian.traineddata Bengali.traineddata Canadian_Aboriginal.traineddata Cherokee.traineddata Cyrillic.traineddata Devanagari.traineddata Ethiopic.traineddata Fraktur.traineddata Georgian.traineddata Greek.traineddata Gujarati.traineddata Gurmukhi.traineddata HanS.traineddata HanS_vert.traineddata HanT.traineddata HanT_vert.traineddata Hangul.traineddata Hangul_vert.traineddata Hebrew.traineddata Japanese.traineddata Japanese_vert.traineddata Kannada.traineddata Khmer.traineddata Lao.traineddata Latin.traineddata Malayalam.traineddata Myanmar.traineddata Oriya.traineddata Sinhala.traineddata Syriac.traineddata Tamil.traineddata Telugu.traineddata Thaana.traineddata Thai.traineddata Tibetan.traineddata Vietnamese.traineddata bre.traineddata chi_sim_vert.traineddata chi_tra_vert.traineddata cos.traineddata div.traineddata fao.traineddata fil.traineddata fry.traineddata gla.traineddata hye.traineddata jpn_vert.traineddata kor_vert.traineddata kur_ara.traineddata ltz.traineddata mon.traineddata mri.traineddata oci.traineddata que.traineddata snd.traineddata sun.traineddata tat.traineddata ton.traineddata yor.traineddata

stweil commented 7 years ago

It will be possible to add new characters by fine tuning!

That's great! Then I can add missing characters (like paragraph for Fraktur) myself. Thank you, Ray.

stweil commented 7 years ago

Ray, issue #65 lists two regressions for Fraktur (missing §, ß/B confusion in word list).

theraysmith commented 7 years ago

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

Since I spotted the edits to the deu/frk wordlists before overwriting them, I will put the deleted words in the bad_words lists, so my next run of training will not contain them. Looks like I also need to add § to the desired_characters.

I have not yet committed the new wordlists, desired_characters etc, since I discovered a bug. The RTL languages have their wordlists reversed, which doesn't make sense. They should be plain text readable by someone who knows the language, and the reversal should be done before the words are converted to dawgs. I have the required change in the code already, but haven't yet run the synthetic data generation.

On Wed, Aug 2, 2017 at 9:03 AM, Stefan Weil notifications@github.com wrote:

Ray, issue #65 https://github.com/tesseract-ocr/tessdata/issues/65 lists two regressions for Fraktur (missing §, ß/B confusion in word list).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319718350, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056QPzkmLa31xAVUTDXnVOGOAZAEWZks5sUJ3DgaJpZM4OpU_- .

-- Ray.

stweil commented 7 years ago

The new files can be installed locally in tessdata/best and used like that: tesseract ... -l best/eng, so we can preserve the current directory structure (also when fast will be added), and there is no need to rename best/eng.traineddata to best_eng.traineddata in local installations.

I assume that older versions of Tesseract work with hierarchies of languages, too. That offers new possibilities: the rather lengthy list of languages could be organized in folders for example for latin based languages, indic languages etc.

Of course tesseract --list-langs should be improved to search recursively for language files.

Shreeshrii commented 7 years ago

used like that: tesseract ... -l best/eng

That is great.

I was using --tessdata-dir ../../../tessdata/best

but this is much easier :-)

Shreeshrii commented 7 years ago

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training.

@theraysmith

The training wiki changes say that new traineddata can be built by providing wordlists. Here you mention that they are generated.

Can you explain, whether user provided wordlists override the ones in traineddata and how it would impact recognition.

I haven't tried training with new code yet.

PS. Hope you have seen language specific feedback provided under issues in tessdata.

amitdo commented 7 years ago

https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf Page 8 T-LSTM Training

amitdo commented 6 years ago

http://usir.salford.ac.uk/44370/1/PID4978585.pdf ICDAR2017 Competition on Recognition of Early Indian Printed Documents – REID2017

Shreeshrii commented 6 years ago

@theraysmith commented on Aug 3, 2017

I have the required change in the code already, but haven't yet run the synthetic data generation.

I will put the deleted words in the bad_words lists, so my next run of training will not contain them.

@theraysmith @jbreiden Can you confirm that the traineddata files in Github repo are the result of this improved training?

stweil commented 6 years ago

They aren't, because they were added in July 2017 – that is before that comment.

Shreeshrii commented 6 years ago

What about tessdata_fast?

Initial import to github (on behalf of Ray)
Jeff Breidenbach committed on Sep 15, 2017
stweil commented 6 years ago

tessdata_fast changed the LSTM model, but not the word list and other components. I just looked for B/ß confusions. While deu.traineddata looks good (no B/ß confusions), frk.traineddata contains lots of them, for example auBer instead of außer. frk.traineddata also contains lots of words which typically are not printed in Fraktur. Neither eBay nor PCMCIA are words which I would expect in old books or newspapers.

ghost commented 6 years ago

@theraysmith can you update the Langdata/ara

kmprerna commented 5 years ago

New traineddata files: Arabic.traineddata Armenian.traineddata Bengali.traineddata Canadian_Aboriginal.traineddata Cherokee.traineddata Cyrillic.traineddata Devanagari.traineddata Ethiopic.traineddata Fraktur.traineddata Georgian.traineddata Greek.traineddata Gujarati.traineddata Gurmukhi.traineddata HanS.traineddata HanS_vert.traineddata HanT.traineddata HanT_vert.traineddata Hangul.traineddata Hangul_vert.traineddata Hebrew.traineddata Japanese.traineddata Japanese_vert.traineddata Kannada.traineddata Khmer.traineddata Lao.traineddata Latin.traineddata Malayalam.traineddata Myanmar.traineddata Oriya.traineddata Sinhala.traineddata Syriac.traineddata Tamil.traineddata Telugu.traineddata Thaana.traineddata Thai.traineddata Tibetan.traineddata Vietnamese.traineddata bre.traineddata chi_sim_vert.traineddata chi_tra_vert.traineddata cos.traineddata div.traineddata fao.traineddata fil.traineddata fry.traineddata gla.traineddata hye.traineddata jpn_vert.traineddata kor_vert.traineddata kur_ara.traineddata ltz.traineddata mon.traineddata mri.traineddata oci.traineddata que.traineddata snd.traineddata sun.traineddata tat.traineddata ton.traineddata yor.traineddata

from where we can download these trained data for better aaccuracy.

Shreeshrii commented 5 years ago

https://github.com/tesseract-ocr/tessdata_best

https://github.com/tesseract-ocr/tessdata_best/tree/master/script

kmprerna commented 5 years ago

When I'm using this trained data for hindi text based image, it's taking long time to extract text and not giving 100% accurate result. So how to reduce the response time.