Special requirements for Hindi and Arabic OCR

HarshitD commented 6 years ago

Summary: I am new to tesseract and Android Studio. I am trying to build android app for OCR using tess two. I was able to make it with the help of internet and it runs for many languages except Hindi. For Hindi, the app just crashes after opening it.

Expected result: Hindi language should also work along with all other languages.

Actual result: The app crashes when I put hin.traineddata file and change the language to Hindi.

Tess-two version: tess-two:5.4.1

Android version: 7.1.2

Phone/device model: Xiaomi Redmi 4

Phone/device architecture (armeabi, armeabi-v7a, x86, mips, arm64-v8a, x86_64, mips64):

Link to training data used: https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Link to image used as input: test24_hin

rmtheis commented 6 years ago

Hmm, can you try again using tess-two version 8.0.0? Hindi is working OK for me in both Tesseract and Cube modes on version 8.0.0.

HarshitD commented 6 years ago

Thanks for the reply. I tried with version 8.0.0 but still same issue. In the build.gradle file of app, I changed the version as compile 'com.rmtheis:tess-two:8.0.0' I am directly using this code : https://github.com/imperialsoup/SimpleTesseractExample Is there some modification to be done in this code to make it work for Hindi? I have hin.traineddata file along with all .cube files under app>assets>tessdata folder.
Could you describe how to make it work for Hindi language? Thanks in advance!!

rmtheis commented 6 years ago

What's the error message that's printed to the device log when your app crashes?

HarshitD commented 6 years ago

Here is the error summary displayed on my android mobile. Screenshot 1 Screenshot 2 According to what I have found is, I think the problem is - for Hindi, I have to use .cube files as well because Tesseract 3 requires .cube files and tess-two works on Tesseract 3. And I am not able to figure out how to use these .cube files. Simply putting .cube files in the folder with hin.traineddata file doesn't work.

Thanks for your help.

rmtheis commented 6 years ago

I can't reproduce the error that you're seeing. Make sure you're using the correct training data file, from the 3.04.00 tag of the tessdata project.

I get the following result for your input image when using the default settings (OEM_TESSERACT_ONLY and PageSegMode.PSM_SINGLE_BLOCK):

राहुल ने तंज कसते हुए कहा कि कि स'घ का उद्देश्य महिलाओं कं! असशक्त करना है. आरएसएस मैं महिलाओं की कोई जगह नहीं है. यथा कांई जानता हैं कि कोई महिला २55 से संबंधित हो और नेतृत्व कर रही हो माल अगर साप महात्मा गांधी की तस्वीर देखेंगे तो उनके दाई और बाई और महिलाओं कं! पाएंगे, मार आप मोहन भागवत की तस्वीर देखेंगे तो या तो दो अकेले होंगे या फिर पुरुषों से घिरे होंगे

राहुल गांधी ने कहा कि अगा हम अंदर की रस्ता में आते है तो हम जीएत्तटी की संरचना में बदलाव लाएंगै और इसे काफी सरल बनाएंगे. उन्होंने कहा कि कांम्रेरर में सबसे अह्म रूप से इस बात का संतुलन रखा गया है कि महिला और पुरुषों की संख्या मैं ज्यादा अंतर नहीं अम मैं मेघालय में पाती की महिलाओं की आमंहिरत करना चाहूगा कि दो पार्टी मैं शामिल हाँ त्ताब्सि हमारे षाटींमें अधिक से अधिक महिलाएं चुनी जा सकें और उन्हें नौका मिलरस्के.

HarshitD commented 6 years ago

Thanks for your reply. However, I still could not resolve the error. I have tried with training data file from here. This page also says that "For Arabic and Hindi you need both the traineddata file and the cube data files." I have searched on internet, many people faced similar problem to mine that the app crashes for Hindi and Arab, but nowhere I found an answer. The closest I found said to include cube data files in the same folder as training data file, but that also doesn't help. Could you please tell me how did you make it run for Hindi?

Thanks a lot for your help.

rmtheis commented 6 years ago

Yes, you need to install hin.* from https://github.com/tesseract-ocr/tessdata/tree/3.04.00

Thanks for reporting this issue. I've created a task (#240) for myself to improve the training data checking for Arabic and Hindi so developers get a clear error message rather than a crash when using the wrong training data files.

HarshitD commented 6 years ago

Thanks for your reply. I installed all hin.* files from the link provided by you but the app still crashes. Could you tell how you made it work for Hindi or share the relevant code?

Thanks for your help.

HarshitD commented 6 years ago

The problem is solved. Thanks for your help. The problem was in TessBaseAPI.init() As I am new to it, I couldn't understand it earlier. After implementing OEM_TESSERACT_ONLY, it worked,

Thanks a lot for your help.

rmtheis commented 6 years ago

Glad you were able to solve the problem!

rmtheis commented 6 years ago

Thanks for looking into this issue. After taking a second look at this, I want to make a note here for reference.

Special requirements for Hindi and Arabic OCR

Arabic and Hindi OCR requires the installation of all Cube data files when using OEM_DEFAULT.

Hindi OCR also works using OEM_TESSERACT_ONLY when the hin.traineddata file is installed, and Hindi also works using OEM_CUBE_ONLY or OEM_TESSERACT_CUBE_COMBINED when the Cube data files are additionally installed.

singhmeenu commented 5 years ago

I am trying to build android app for OCR Hindi using tess two. It runs for many languages except Hindi. For Hindi, the app just crashes when try to scan any hindi language. I tried all OEM_TESSERACT_ONLY, OEM_TESSERACT_CUBE_COMBINED, OEM_CUBE_ONLY and PSM_SINGLE_BLOCK but app not working. Please give any solution .

Crash: java.lang.IllegalArgumentException: Cube data files not found. See https://github.com/rmtheis/tess-two/issues/239 at com.googlecode.tesseract.android.TessBaseAPI.init(TessBaseAPI.java:347) at com.googlecode.tesseract.android.TessBaseAPI.init(TessBaseAPI.java:303) at com.ashomok.tesseractsample.MainActivity.extractText(MainActivity.java:352)

DorisGM commented 5 years ago

I include ara.cube.* and user OEM_TESSERACT_ONLY , app still crash

DorisGM commented 5 years ago

I include ara.cube.* and user OEM_TESSERACT_ONLY , app still crash

I also use OEM_CUBE_ONLY

rmtheis / tess-two

Special requirements for Hindi and Arabic OCR #239

Special requirements for Hindi and Arabic OCR