tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.85k stars 9.47k forks source link

extra spaces in Chinese OCR with pdf.js #3904

Closed eyagarci closed 2 years ago

eyagarci commented 2 years ago

Hello, I made some Images recognition for chinese language. I found that the resulted text has different spacing between its characters with pdf.js. I use preserve_interword_spaces=1 to remove extra spaces but I did'nt find any difference.

I did some tests with other viewers like adobe acrobat reader and chrome. I found a difference between the results. Do you have any idea how to solve this problem with pdf.js.

1

Pdf.js:

  1. 每 日 測試樣品 , 務 必 先做儀器 日校正, 並於每 第 一次 測試樣品 , 需 先做儀器 週校正

Adobe acrobat reader:

  1. 每日測試樣品, 務必先做儀器日校正, 並於每第一次測試樣品, 需先做儀器週校正

Chrome:

  1. 每日 測試樣品, 務必先做儀器 日校正, 並於每第 一次測試樣品, 需 先做儀器週校正

Environment:

Tesseract: 4.0.0 Windows 10 (64 bit)

amitdo commented 2 years ago

This is a duplicate of other past reports from other users. See label:non spaced words

The issue is that: 1) Tesseract produce a space between Chinese glyphs. 2) Different PDF viewers can present the same file differently.

Currently, there is no solution to this issue.