nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

doOCR() vs. createDocuments() / createDocumentsWithResults() #241

Closed vinodmap closed 1 year ago

vinodmap commented 1 year ago

Hello, this is a great tool, thank you. We are using version 5.1.0. But why doesn't the text/string that is generated from doOCR() match text/string generated from createDocuments() / createDocumentsWithResults() ? We are inputting the same tiff file into both functions and the resulting text is different. Essentially, we want to know which one of these we should use, and if they were producing the same results, we could confidently choose either. But since, they don't, we need to decide which one to use (which one is better). We don't understand why they don't match exactly.

For example, a snippet of our code for createDocumentsWithResults():

List<RenderedFormat> listRF = new ArrayList<RenderedFormat>();
listRF.add(RenderedFormat.PDF);
listRF.add(RenderedFormat.TEXT);
aryStrFileInputPaths[0] = "<SOME_PATH>\\tiff1.tiff";
List<OCRResult> listOCRResult = tessInst.createDocumentsWithResults(aryStrFileInputPaths, aryStrFileOutputPaths, listRF, 0);
String strResultCreateDocs = listOCRResult.toString();

Note, aryStrFileInputPaths is a String Array with paths of the tiffs, and aryStrFileOutputPaths is a String Array of pdf filenames to be generated.

An example of our code for doOCR() is:

String strImagePagePathAndFileName = "<SOME_PATH>\\tiff1.tiff";
File fileObjImage = new File(strImagePagePathAndFileName);
String strResultDoOCR = tessInst.doOCR(fileObjImage);

The one change that strResultCreateDocs has is that it includes extra text like "[Average Text Confidence: 82% Words:" and "[Confidence: 82.831604 Bounding box: 100 313 2298 2906]". We understand these differences. Aside from those differences, we would expect an exact match of the text. But, in many cases, there are differences in text throughout the page.

Based on a previous response from an issue (https://github.com/nguyenq/tess4j/issues/154#issuecomment-503820107), it states "New APIs similar to createDocuments have been added to support buffered images as input."

But, that does not answer the question we have. Is createDocumentsWithResults() and doOCR() using a different engine to generate the OCR'd text, and if not, why does it generate different results? If there is a way to get them to generate the exact same text, any info on how to do that would be most appreciated. Thank you.

gratefulcreative commented 1 year ago

Great question - I'm dealing with the same issue : 0

nguyenq commented 1 year ago

They invoke different API methods provided by Tesseract OCR engine. One, GetxxxText, which doOCR calls, returns plain text strings, and the other, ResultRenderer, produces output files. They follow different execution paths in Tesseract code, thus likely have different output results.

https://github.com/tesseract-ocr/tesseract/blob/main/include/tesseract/capi.h

vinodmap commented 1 year ago

Thanks. Any idea why the Tesseract OCR engine provides for these two different execution paths and different results? Not sure why there would be two output options, is one better than another?

nguyenq commented 1 year ago

GetxxxText was provided since the beginning of Tesseract library. ResultRenderer API was added in recent year and is the interface for rendering tesseract results into a document, such as text, HOCR or pdf; it is the only method that can produce PDF documents.