tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.48k stars 9.23k forks source link

Self training data of Chinese language with jTessBoxEditorFX #3077

Open FounderBox opened 3 years ago

FounderBox commented 3 years ago

Environment Tesseract Version: image Commit Number: None Platform: X64 Windows 10

Current Behavior: I want to recognition below picture. image

But it got 2 error words in chinese, so I use jTessBoxEditorFX to fix it as below. image

And generate a new mylang.traineddata file to my tessdata. image

If I only use the mylang as language, it works fine, two wrong words has been fixed. image image

but if I use below mutil-language, chi_sim+mylang, it got error again. image image

or use below mutil-language, mylang+chi_sim, it even got all wrong. image image

Expected Behavior: So as you can see the two words be fixed only when I use single mylang as language, If I use mutil-language, it got error again.

Is there a way that set myself training traineddata file as a supplement dataset to the original chi_sim.traineddata? So I can fix all wrong words which can not be recognitioned with chi_sim.traineddata file, thanks a lot! :)

FounderBox commented 3 years ago

Any one can help, a million thanks!

FounderBox commented 3 years ago

@Shreeshrii Could you please give a hand? thanks