Closed xiaomaxiao closed 7 years ago
Langdata has not been updated for 4.0
You can use current files for finetuning, not for training from scratch.
On 06-Oct-2017 11:14 AM, "xiaomaxiao" notifications@github.com wrote:
it's not enough for training lstm 4.0
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/94, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o65-E7hhNFKgj69ygLtWaGslHflzks5spb5OgaJpZM4PwB4F .
@Shreeshrii thanks .
@Shreeshrii Hi! I'm trying to replicate the lstm model via tensorflow, but the problem is regarding the language data. Is there any update on the langdata for 4.0, or can I generate the same training data myself based on the current information? Thanks!
@jbreiden is the right person to ask regarding updated langdata. As far as I know it has not been updated for 4.0.
There is no way to get the complete info from existing files. You can unpack the traineddata file from tessdata_fast, but that will only provide you a wordlist, not training text. And, you won't know the fonts which were used.
it's not enough for training lstm 4.0