nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

Getting error Invalid memory access when use Tess4j in JAVA project Netbeans #232

Closed HerickRaposo closed 1 year ago

HerickRaposo commented 2 years ago

I'm developing a routine in my JAVA Web project (Using netbeans 13) where I extract the texts from a pdf. If it doesn't find a certain term, it converts the pdf to image and tries to extract the text with OCR tesseract. After several attempts I always got the same error regardless of the configuration I did. Following error:

Error opening data file tessdata/por.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'por'
Tesseract couldn't load any languages!

Caused by: java.lang.Error: Invalid memory access

I'm having difficulties configuring the library in my development environment because, unlike the tutorials where SpringBoot and Eclipse are used, I use Netbeans 13 and I have the following project structure:

enter image description here

First I added the dependency in pom.xml:

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>5.2.0</version>
        </dependency>

After that as indicated in some tutorials, I went to my tessdata folder which is in dependencies/Tes4j and copied the tessdata folder to the web-inf folder (I also tried to the resources folder).

Later I tried to configure the environment variable TESSDATA_PREFIX. I couldn't find information if this variable was to be defined in the windows system variables or if there is some other place in netbeans to define this. In my project structure I tried three definitions but none worked:

  1. C:\Programacao\myProjectName\Tess4J
  2. C:\Programacao\myProjectName\src\main\webapp\WEB-INF
  3. C:\Programacao\myProjectName\src\main\webapp\resources

In the code of my method I tried to set the datapath passing only the name tessdata, data and I also tried to point out the paths above. Follow created method:

 public String extractText(Anexos anexo) throws Exception {
        File file = new File(anexo.getCaminho());
        PDDocument doc = PDDocument.load(file);

        System.out.println("=================================> Extraindo com pdfBox <=========================");
        PDFTextStripper estripador = new PDFTextStripper();
        estripador.setSortByPosition(false); 
        String pdfTexto = estripador.getText(doc);
        String line = "";

        line = pdfTexto.toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
                .replaceAll("\\s", " ");

        if (line.contains("sped")) {
            return line;
        } else {

            PDFRenderer pdfRenderer = new PDFRenderer(doc);
            StringBuilder out = new StringBuilder();

            Tesseract tesseract = new Tesseract();
            tesseract.setLanguage("por");
            tesseract.setOcrEngineMode(1);

            Path dataDirectory = Paths.get("tessdata");
            tesseract.setDatapath(dataDirectory.toString());

            for (int page = 0; page < doc.getNumberOfPages(); page++) {
                BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(page, 300, ImageType.RGB);

                // Create a temp image file
                File tempFile = new File(file.getPath().replace(File.separator + anexo.getAnexo(), "") + File.separator + "tempfile_" + anexo.getAnexo().replaceAll("\\..*", "") + "_" + page + ".png");
                ImageIO.write(bufferedImage, "png", tempFile);
                String result = tesseract.doOCR(tempFile);
                out.append(result);

                // Delete temp file
                tempFile.delete();
            }

            line = out.toString().toLowerCase().replaceAll(AplicacaoBean.CARACTERES_ESPECIAIS_REGEX, "")
                    .replaceAll("\\s", " ");
        }

        return line;
    }

My tessdata folder:

enter image description here

My development environment:

Netbeans:13

JDK 15

tess4j:5.2.0

My doubts:

So I would like to know if this environment variable () is configured in windows variables, in netbeans or in some internal place in my code? Also, did I skip any steps? Do I need to download anything else? Please help me, I don't know what else to do!

nguyenq commented 2 years ago

You would set the environment variable TESSDATA_PREFIX in Windows.

https://superuser.com/questions/949560/how-do-i-set-system-environment-variables-in-windows-10

Does your example work with eng pack? I suggest you get it to work with a console application first before moving on to a server application.

nguyenq commented 1 year ago

No response from OP.