nguyenq / tess4j

Java JNA wrapper for Tesseract OCR API
Apache License 2.0
1.58k stars 372 forks source link

The recognition results of tesseract-ocr and tess4j are not the same #261

Closed ciyushan closed 6 months ago

ciyushan commented 6 months ago

win11 jdk 17 cmd:tesseract -v out: tesseract v5.3.0.20221222 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX512BW Found AVX512F Found AVX512VNNI Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

tess4j.version :5.10.0

I trained a new Chinese library, using the new Chinese library, using tesseract-ocr recognition on the command line exactly as expected, but not so well in tess4j

The java code is as follows:

File imageFile = new File("E:\tess4j\src\main\resources\1.png"); ITesseract instance = new Tesseract(); instance.setDatapath("E:\tess4j\src\main\resources"); instance.setLanguage("jslang"); try { long startTime = System.currentTimeMillis(); String result = instance.doOCR(imageFile); System.out.println("Result:\n" + result); long endTime = System.currentTimeMillis(); System.out.println("Time is:" + (endTime - startTime) + " ms"); } catch (TesseractException e) { System.err.println(e.getMessage()); } out: 那时我惟一的希望, 就在这雷峰塔的倒掉。 后来我长大了, 到杭州, 看见这 破破烂烂的塔, 心里就不舒服。 后来我看看书, 说杭州人又叫这塔作 “保叔 塔” , 其实应该写作 “保 (左人右叔) 塔” , 是钱王的儿子造的。 那么儿 里 面当然没有白蛇娘娘了, 然而我心里仍然不舒服, 仍然希望他倒掉。

现在, 他居然倒掉了, 则普天之下的人民, 其欣喜为何如? 这是有事实可证

cmd:

tesseract 1.png result -l jslang

out:

那时我惟一的希望,就在这雷峰塔的倒掉。后来我长大了,到杭州,看见这 破破烂烂的塔,心里就不舒服。后来我看看书,说杭州人又叫这塔作“保叔 塔” , 其实应该写作 “保 (左人右叔) 塔” , 是钱王的儿子造的。 那么 , 里 面当然没有白蛇娘娘了,然而我心里仍然不舒服,仍然希望他倒掉。

现在, 他居然倒掉了, 则普天之下的人民, 其欣喜为何如? 这是有事实可证

The image and language library are the same, but the results are different, is there a conflict between the tess4j and tesseract versions? Or does tess4j have parameters to set extra? How can I make tess4j recognition work the same as tesseract? Can I provide relevant information and help me solve the problem?

nguyenq commented 6 months ago

Can you try in VietOCR3, a GUI which uses Tess4J library?

nguyenq commented 6 months ago

Duplicate of https://github.com/nguyenq/tess4j/issues/259

ciyushan commented 6 months ago

您可以尝试使用 Tess4J 库的 GUI VietOCR3 吗?

Thanks for the reply, through your prompt, I successfully got the result I needed using the api using VietOCR3-6.12.0 version, thank you very much for your reply!!

ciyushan commented 6 months ago

Duplicate of #259

The problem is not solved, I was wrong before, now I try to use VietOCR3 to debug, but I don't find any exceptions, but there is still cmd and the result of code recognition is not the same。

tessdata file:jslang.traineddata img:1.png 1 tessdata.zip

ciyushan commented 6 months ago

Duplicate of #259

I suspect it's a difference in how code calls and cmd calls handle images, as I've found similar issues in other OCR frameworks. url:https://github.com/hiroi-sora/Umi-OCR/issues/272

nguyenq commented 6 months ago

@ciyushan Did you setPageSegMode to 3 as the other poster did to fix their issue?

Please attach your test image. And can you highlight the discrepancy/difference as we see the two results look very similar?