win10 chinese filename. TesseractException: Error during processing page.

nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API

Apache License 2.0

1.6k stars 373 forks source link

win10 chinese filename. TesseractException: Error during processing page. #75

Open dgq420377903 opened 6 years ago

dgq420377903 commented 6 years ago

I run tess4j on win10 TesseractException: Error during processing error page. Tesseract.createDocuments (Tesseract.java:565) I think the reason is Chinese file name, but do not know how to solve

    File imageFile1 = new File(config.getOcrSrcDir(), "2014-中文名-13783.jpg");
    File pdfFile1 = new File(config.getOcrDesDir(), "2014-中文名-13783");
    ITesseract tess = new Tesseract();
    tess.setLanguage("chi_sim");
    try {
      List<RenderedFormat> formats = new ArrayList<RenderedFormat>();
      formats.add(RenderedFormat.PDF);
      String[] images = new String[] {imageFile1.getAbsolutePath()};
      String[] pdfs = new String[] {pdfFile1.getAbsolutePath()};
      tess.createDocuments(images, pdfs, formats);
    } catch (TesseractException e) {
      e.printStackTrace();
    }

nguyenq commented 6 years ago

That's more on the Java side than Tess4J. It's suggested that you check for the file existence before attempting to do OCR on the image file. If Java does not support the current file name, you may have to use a different file naming that Java supports.

dgq420377903 commented 6 years ago

thx

maherm commented 6 years ago

This is not a Java issue. Having the same problem here using german umlauts (äüö) in paths. The files definitely exist and no other part of the software has a problem with it.
Seems more like an encoding problem somewhere along the way in JNA, converting non-ascii filename java.lang.Strings into char* for TessBaseAPIProcessPages. After some googling I've already tried setting the "jna.encoding" property, without success.

Platform: Windows 7 Java Version: 1.8.0_171-b11 tess4j Version: 4.0.3-SNAPSHOT (also tested in 2.0.1 with Tesseract 3.05.01) tesseract Version: 4.0.0-beta.1.20180608

I'll try to provide a testcase for you to reproduce.

maherm commented 6 years ago

I added unit tests to help you reproduce the error at https://github.com/maherm/tess4j

nguyenq commented 6 years ago

@maherm I confirm your findings. TessBaseAPIProcessPages would immediately return when processing a non-ascii filename. It's something inside JNA.

An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile ?) and rename it back -- a bit of hassles.

Or use TessBaseAPIProcessPage method if you really need TessResultRenderer API.

maherm commented 6 years ago

@nguyenq Thanks for having a look at this.

An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile ?) and rename it back -- a bit of hassles.

That is kind of what I do at the moment: making sure there is never a path passed to tess4j that contains non-ascii symbols. It's a rather ugly workaround, but it does the trick at the moment.

I propose reopening this issue until it is fixed.

nguyenq commented 3 years ago

https://github.com/DanBloomberg/leptonica/issues/537