tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

Add grc subdirectory with component files #78

Closed Arithmeticus closed 6 years ago

Arithmeticus commented 7 years ago

There is a grc.trainddata file, but no corresponding /grc subdirectory with the build files. Could that be supplied? Or is there a safe way to split a .traineddata file into its constituent parts?

Shreeshrii commented 7 years ago

See https://ancientgreekocr.org/

Arithmeticus commented 7 years ago

I began my quest from that site. But the @nickjwhite git repos lack the requisite Tesseract files that are in all the other langdata subdirectories. Even if the files are somehow at ancientgreekocr.org, the tesseract langdata repo should have a /grc subdirectory populated with the build files.

Shreeshrii commented 7 years ago

See Pull Request by Nick White - https://github.com/tesseract-ocr/langdata/pull/19

and

https://groups.google.com/forum/#!topic/tesseract-dev/Iqsa7y2g3sk

nickjwhite commented 7 years ago

Thanks for answering this @Shreeshrii.

@Arithmeticus, note that the files in the langdata repository are designed to be used as input to tesstrain.sh from tesseract's training/ directory, which is why some of the files you may be expecting such as .inttemp aren't present. That is the same with all of the langdata directories.

theraysmith commented 6 years ago

There will be grc source files in the next release of langdata. It will be missing desired_characters and forbidden_characters unless you would like to contribute some...

Shreeshrii commented 6 years ago

https://github.com/tesseract-ocr/langdata/tree/master/grc

PR by @nickjwhite has been merged.

Shreeshrii commented 6 years ago

@zdenop This can be closed.