tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Add Filipino lang #84

Open JohnHenryGaspay opened 7 years ago

JohnHenryGaspay commented 7 years ago

Would it be also good if you guys can support filipino language here in the Philippines.

JohnHenryGaspay commented 7 years ago

@theraysmith I've noticed that there is no Filipino language on the list of data.

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tessdata/raw/master/best/fil.traineddata

theraysmith commented 7 years ago

My training text corpus does not distinguish between fil and tgl, while they show up in ISO-639-2T as distinct. For some reason that I can't remember now, the language code has switched from tgl to fil in the "best" models that I pushed recently.

Does the fil language do what you want? If not please try to explain why. You could also try Latin, which attempts to cover all latin-based languages.

JohnHenryGaspay commented 7 years ago

@amitdo I've tried adding it to the language folders but when selecting fil as language the app always shut down.

JohnHenryGaspay commented 7 years ago

@theraysmith Yes our national language here in the Philippines is Filipino(fil) and tagalog(tgl) is the old name for that. I've tried the Latin but it's not working.

Shreeshrii commented 7 years ago

I tested just now, with both best/fil and tgl (4.00.00alpha traineddatas) and they work with tesseract built from latest github code.

 tesseract fil-test.png fil-test-best-fil --oem 1 --psm 6 -l best/fil --tessdata-dir ../

 tesseract fil-test.png fil-test-tgl --oem 1 --psm 6 -l tgl  --tessdata-dir ../

Files attached. To me best/fil seems more accurate. I took a snapshot from tgl wikipedia page.

fil-test-tgl.txt fil-test-best-fil.txt fil-test

amitdo commented 7 years ago

I've tried adding it to the language folders but when selecting fil as language the app always shut down.

You should try running Tesseract from the command-line.