The recognition results of tesseract-ocr and tess4j are not the same

ciyushan commented 6 months ago

win11 jdk 17 cmd：tesseract -v out： tesseract v5.3.0.20221222 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX512BW Found AVX512F Found AVX512VNNI Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

tess4j.version ：5.10.0

I trained a new Chinese library, using the new Chinese library, using tesseract-ocr recognition on the command line exactly as expected, but not so well in tess4j

The java code is as follows：

File imageFile = new File("E:\tess4j\src\main\resources\1.png"); ITesseract instance = new Tesseract(); instance.setDatapath("E:\tess4j\src\main\resources"); instance.setLanguage("jslang"); try { long startTime = System.currentTimeMillis(); String result = instance.doOCR(imageFile); System.out.println("Result：\n" + result); long endTime = System.currentTimeMillis(); System.out.println("Time is：" + (endTime - startTime) + " ms"); } catch (TesseractException e) { System.err.println(e.getMessage()); } out：那时我惟一的希望，就在这雷峰塔的倒掉。后来我长大了，到杭州，看见这破破烂烂的塔，心里就不舒服。后来我看看书，说杭州人又叫这塔作 “保叔塔” ，其实应该写作 “保（左人右叔）塔” ，是钱王的儿子造的。那么儿里面当然没有白蛇娘娘了，然而我心里仍然不舒服，仍然希望他倒掉。

现在，他居然倒掉了，则普天之下的人民，其欣喜为何如? 这是有事实可证

cmd：

tesseract 1.png result -l jslang

out：

那时我惟一的希望，就在这雷峰塔的倒掉。后来我长大了，到杭州，看见这破破烂烂的塔，心里就不舒服。后来我看看书，说杭州人又叫这塔作“保叔塔” ，其实应该写作 “保（左人右叔）塔” ，是钱王的儿子造的。那么，里面当然没有白蛇娘娘了，然而我心里仍然不舒服，仍然希望他倒掉。

现在，他居然倒掉了，则普天之下的人民，其欣喜为何如? 这是有事实可证

The image and language library are the same, but the results are different, is there a conflict between the tess4j and tesseract versions? Or does tess4j have parameters to set extra? How can I make tess4j recognition work the same as tesseract? Can I provide relevant information and help me solve the problem?

nguyenq commented 6 months ago

Can you try in VietOCR3, a GUI which uses Tess4J library?

nguyenq commented 6 months ago

Duplicate of https://github.com/nguyenq/tess4j/issues/259

ciyushan commented 6 months ago

您可以尝试使用 Tess4J 库的 GUI VietOCR3 吗？

Thanks for the reply, through your prompt, I successfully got the result I needed using the api using VietOCR3-6.12.0 version, thank you very much for your reply!!

ciyushan commented 6 months ago

Duplicate of #259

The problem is not solved, I was wrong before, now I try to use VietOCR3 to debug, but I don't find any exceptions, but there is still cmd and the result of code recognition is not the same。

tessdata file：jslang.traineddata img：1.png tessdata.zip

ciyushan commented 6 months ago

Duplicate of #259

I suspect it's a difference in how code calls and cmd calls handle images, as I've found similar issues in other OCR frameworks. url：https://github.com/hiroi-sora/Umi-OCR/issues/272

nguyenq commented 6 months ago

@ciyushan Did you setPageSegMode to 3 as the other poster did to fix their issue?

Please attach your test image. And can you highlight the discrepancy/difference as we see the two results look very similar?

nguyenq / tess4j

The recognition results of tesseract-ocr and tess4j are not the same #261