tesseract-ocr / tessdoc

Tesseract documentation
https://tesseract-ocr.github.io/tessdoc/
1.84k stars 363 forks source link

Clarify language support quality status #83

Open eyalroz opened 2 years ago

eyalroz commented 2 years ago

The README.md says tesseract "supports over 100 languages out of the box". But - which languages? And what quality is the support for different languages known to be, out of the box?

It would be helpful if a separate file (or wiki page) would detail, to the extent possible, this information.

stweil commented 2 years ago

See https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html. All work on Tesseract is currently done by volunteers, so you are invited to find the answers to your questions and document them.

eyalroz commented 2 years ago

@stweil : Can you linkify the "100 languages" sentence in the README.md to point to that page?

tooomm commented 1 year ago

@eyalroz I went ahead and propsed the change in the tesseract repo: https://github.com/tesseract-ocr/tesseract/pull/4027

I also think it would be very helpful. Even though the list itself has no information on languages in v5 yet.

amitdo commented 1 year ago

Even though the list itself has no information on languages in v5 yet.

There was no update for v5. All the v4 data files should work with Tesseract 5.x.

tooomm commented 1 year ago

There was no update for v5. All the v4 data files should work with Tesseract 5.x.

That's at least not obvious from the table.

The information can be found in other parts of the docs, true. Users can easily miss it though. Language model traineddata files same as listed above for version 4.0.0 can be used with Tesseract 5.x.x.

amitdo commented 1 year ago

https://github.com/tesseract-ocr/docs/blob/main/das_tutorial2016/7Building%20a%20Multi-Lingual%20OCR%20Engine.pdf

https://arxiv.org/pdf/2202.13274.pdf