tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
827 stars 886 forks source link

why eng.training_text just has 72 lines? #94

Closed xiaomaxiao closed 6 years ago

xiaomaxiao commented 6 years ago

it's not enough for training lstm 4.0

Shreeshrii commented 6 years ago

Langdata has not been updated for 4.0

You can use current files for finetuning, not for training from scratch.

On 06-Oct-2017 11:14 AM, "xiaomaxiao" notifications@github.com wrote:

it's not enough for training lstm 4.0

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/94, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o65-E7hhNFKgj69ygLtWaGslHflzks5spb5OgaJpZM4PwB4F .

xiaomaxiao commented 6 years ago

@Shreeshrii thanks .

Layneww commented 6 years ago

@Shreeshrii Hi! I'm trying to replicate the lstm model via tensorflow, but the problem is regarding the language data. Is there any update on the langdata for 4.0, or can I generate the same training data myself based on the current information? Thanks!

Shreeshrii commented 6 years ago

@jbreiden is the right person to ask regarding updated langdata. As far as I know it has not been updated for 4.0.

There is no way to get the complete info from existing files. You can unpack the traineddata file from tessdata_fast, but that will only provide you a wordlist, not training text. And, you won't know the fonts which were used.