nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.61k stars 373 forks source link

Get words and confidences #213

Open peterkronenberg opened 3 years ago

peterkronenberg commented 3 years ago

I found this repo, at https://github.com/nguyenq/tess4j/tree/master/src/test/java/net/sourceforge/tess4j, which is different from the tess4j 4.5.4 distribution. How is this code different?

The code in TessApiTest has some good examples of getting the confidence values. But I can't figure out how the Progress Monitor is used. Since I didn't need a monitor, I tried to eliminate its usage. But I get errors about memory leaks. It needs to be passed to the call to TessBaseAPIRecognize. This ProgressMonitor class doesn't exist at all in the tess4j distribution, although the call to TessBaseAPIRecognize does require an argument of a ETEXT_DESC. Can you explain more?

nguyenq commented 3 years ago

The distribution is based on https://github.com/nguyenq/tess4j/tree/tess4j-4 branch. The master contains development code for latest Tesseract 5.x version.

ProgressMonitor is a client class designed to poll the engine for progress status; however, it seems to no longer work. If you want the feature, you'd need to use the recently added TessMonitor API methods. For example, please consult Tesseract documentation or its unit tests.

Below is an example of calling TessBaseAPIAllWordConfidences method. The calling function must delete the array after use, which I have not been able to do.

/**
     * Test of TessBaseAPIAllWordConfidences method, of class TessAPI.
     *
     * @throws java.lang.Exception
     */
    @Test
    public void testTessBaseAPIAllWordConfidences() throws Exception {
        logger.info("TessBaseAPIAllWordConfidences");
        File tiff = new File(this.testResourcesDataPath, "eurotext.tif");
        Pix pix = Leptonica1.pixRead(tiff.getPath());
        TessAPI1.TessBaseAPIInit3(handle, datapath, language);
        TessAPI1.TessBaseAPISetImage2(handle, pix);
        IntByReference wordConfidences = TessAPI1.TessBaseAPIAllWordConfidences(handle);
        Pointer confs = wordConfidences.getPointer();
        int i = 0;
        int word = 0;
        while (true) {
            int conf = confs.getInt(i);
            if (conf == -1) {
                break; // array terminated by -1
            }
            i++;
            if (conf < 0 || conf > 100) {
                continue; // skip invalid confidence value
            }
            word++;
            logger.info("Word Confidence " + word + ": " + conf);
        }

//        IntBuffer ib = IntBuffer.wrap(confs.getIntArray(0, i));
//        TessAPI1.TessDeleteIntArray(ib);

        //release Pix resource
        PointerByReference pRef = new PointerByReference();
        pRef.setValue(pix.getPointer());
        Leptonica1.pixDestroy(pRef);

        assertTrue(i > 0);
    }
peterkronenberg commented 3 years ago

Thank you.

What do you mean when you say you haven't been able to free the array? I see the code you have commented. Do you mean it's not working?

What would be the best way if I wanted to have a re-usable instance that I can pass in multiple files successively? If I initialize TessAPI1 just once with TessAPI1.TessBaseAPIInit3(handle, datapath, language);

could I then re-use that instance to process multiple files like this

// Process 1st file
File tiff = new File(this.testResourcesDataPath, "file1.tif");
Pix pix = Leptonica1.pixRead(tiff.getPath());
TessAPI1.TessBaseAPISetImage2(handle, pix);
.
.
.
// close resource
PointerByReference pRef = new PointerByReference();
pRef.setValue(pix.getPointer());
Leptonica1.pixDestroy(pRef);

// Process 2nd file
File tiff = new File(this.testResourcesDataPath, "file2.tif");
Pix pix = Leptonica1.pixRead(tiff.getPath());
TessAPI1.TessBaseAPISetImage2(handle, pix);
.
.
.
// close resource
PointerByReference pRef = new PointerByReference();
pRef.setValue(pix.getPointer());
Leptonica1.pixDestroy(pRef);

// when I'm all done, are there any other resources that need to be closed/released?
peterkronenberg commented 3 years ago

I just realized this only returns the confidences and not the words. How do I get the words? TessBaseAPIGetUTF8Textonly returns a single word. Is there a way to get the words and confidences at once?

nguyenq commented 3 years ago

Right, I haven't been able to free the array.

Your approach for multiple images looks alright, but beware of memory leaks from Tesseract library. You may want to start a new instance after so many images.

You need to read the documentation better. There's a Tesseract.getWords method that can get both the text and its confidence value.

peterkronenberg commented 3 years ago

ok, I see now. It wasn't immediately obvious that getWords() also includes the confidence. The Javadoc is fine to some extent. It documents the methods, but it's not always clear how to use them. For example, I'm not sure what the pageIteratorLevel is. Do I just start off with 0? In my particular use case, each file just has a few words, on a single page. And there is no real documentation about how all the other structures work, such as the various iterators.

Also, getWords() initializes TessAPI and disposes it each time. Seems like it would be more performant to write my own version that copies the code in init(), calls it once and then copies the code in getWords(). Would be nice if the code already contained the building blocks I need without having to replicate the code myself. I would have liked to be able to just create an instance of Tesseract and then call tess.init() and another version of tess.getWords() that doesn't do the setup and break down. And just leave that aspect of it to the caller

nguyenq commented 3 years ago

The building blocks are in TessAPI class, which mirrors the C-API of Tesseract native library. The provided unit tests and Tesseract class already depict typical usages of the API. It's unrealistic to expect all possible use cases documented. If you want to get into more depth, you need to consult Tesseract's native code and documentation.

You can either implement your custom class or extend the existing ones.

Good luck.

peterkronenberg commented 3 years ago

I guess I'm talking about higher-level building blocks. I can extend Tesseract and call init() since it's protected, but getWords() is doing too much, so I'd have to implement my own to separate the init and destroy from the core functionality of getting the words This is true for most of the methods that call init(). Wwould be more useful if getWords() was implemented like this:

init()
_getWords()
destroy()

Where _getWords() has the core functionality. This would allow someone to call getWords() and have the same behavior as today but it would also allow someone to call _getWords() directly and handle the init stuff themselves

I appreciate your help and all the work you have put into this