nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

Unable to set non-English datapath in Tess4J #252

Closed sleepybear1113 closed 6 months ago

sleepybear1113 commented 11 months ago

I would like to report an issue regarding setting a non-English datapath in Tess4J. Currently, the library does not support using a datapath with Chinese characters, which limits its usability for users with non-English paths.

Example:D:/测试路径/eng.traineddata, I set dataPath to D:/测试路径, it will print

Error opening data file D:/测试路径/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!

java.lang.Error: Invalid memory access
    at net.sourceforge.tess4j.TessAPI1.TessBaseAPIGetUTF8Text(Native Method)
    at net.sourceforge.tess4j.Tesseract1.getOCRText(Tesseract1.java:512)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:318)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:291)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:272)
    at net.sourceforge.tess4j.Tesseract1.doOCR(Tesseract1.java:256)

when using path D:/test, it works.

Is it possible to modify the library's code to support non-English paths, such as setting a datapath with Chinese characters? This would greatly enhance the flexibility and usability of Tess4J for a wider range of users.

[The above content is built using gpt, the original text is from Chinese ]

nguyenq commented 7 months ago

@sleepybear1113

For our test cases, we set TESSDATA_PREFIX environment variable to various values: D:\Test\tessdata-á, D:\Test\tessdata-â, and D:\Test\tessdata-ấ, and run tesseract --list-langs command for each. It worked with the first two cases, which use extended ASCII characters, but not with the last one, which contains a Unicode character. Tesseract engine apparently does not support Unicode characters in tessdata path.

nguyenq commented 7 months ago

Duplicate of Issue #190