Open dgq420377903 opened 6 years ago
That's more on the Java side than Tess4J. It's suggested that you check for the file existence before attempting to do OCR on the image file. If Java does not support the current file name, you may have to use a different file naming that Java supports.
thx
This is not a Java issue. Having the same problem here using german umlauts (äüö) in paths. The files definitely exist and no other part of the software has a problem with it.
Seems more like an encoding problem somewhere along the way in JNA, converting non-ascii filename java.lang.Strings into char* for TessBaseAPIProcessPages. After some googling I've already tried setting the "jna.encoding" property, without success.
Platform: Windows 7 Java Version: 1.8.0_171-b11 tess4j Version: 4.0.3-SNAPSHOT (also tested in 2.0.1 with Tesseract 3.05.01) tesseract Version: 4.0.0-beta.1.20180608
I'll try to provide a testcase for you to reproduce.
I added unit tests to help you reproduce the error at https://github.com/maherm/tess4j
@maherm I confirm your findings. TessBaseAPIProcessPages
would immediately return when processing a non-ascii filename. It's something inside JNA.
An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile
?) and rename it back -- a bit of hassles.
Or use TessBaseAPIProcessPage
method if you really need TessResultRenderer
API.
@nguyenq Thanks for having a look at this.
An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile ?) and rename it back -- a bit of hassles.
That is kind of what I do at the moment: making sure there is never a path passed to tess4j that contains non-ascii symbols. It's a rather ugly workaround, but it does the trick at the moment.
I propose reopening this issue until it is fixed.
I run tess4j on win10 TesseractException: Error during processing error page. Tesseract.createDocuments (Tesseract.java:565) I think the reason is Chinese file name, but do not know how to solve