ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.59k stars 994 forks source link

extra space in the result pdf when the input pdf is in Chinese #715

Open Eyxxxxx opened 3 years ago

Eyxxxxx commented 3 years ago

Hi. First, sorry for my poor English.

Description Recently I upgraded my tesseract engine from v4.0.0.20181030 to v5.0.0-alpha.20201127 and two things happened. One is there is space between every single words when i OCR a pdf with pure English text which is good and i didn't get those extra space when my engine was v4.0. That means i got text like "thereisnospacebetweenwords " before, and now it becomes "there is no space between words ". However, with the v5.0 engine, it went wrong when my input pdf is in Chinese, as there is extra space between every single letter. The result now is like 每 个 字 之 间 都 有 多 余 的 空 格 。 (FYI, i didn't get those extra space when using ocrmypdf to OCR Chinese pdf with tesseract v4.0)

To Reproduce my tesseract engines are the following downloaded from https://digi.bib.uni-mannheim.de/tesseract/ tesseract-ocr-w64-setup-v4.0.0.20181030.exe tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe

I just typed ocrmypdf input_name.pdf OCR-output_name.pdf -l chi_sim in CLI.

Expected behavior I wish to keep the space in the english text, while omit the extra space in the chinese text.

System (please complete the following information):

Additional context I tested the tesseract engine v5.0, and the output text is just fine after i used parameter --psm 6. But the parameter seems doesn't work well for ocrmypdf. (The parameter does work a little bit in ocrmypdf, the text layer changed from 每 个 字 都 有 多 余 的 空 格 to 每 个 字 都有 多余 的 空格.

FYI: The solution for the extra space in CJK in tesseract https://github.com/tesseract-ocr/tesseract/issues/991

Please let me know if you need any further information. Thanks!

jbarlow83 commented 3 years ago

The equivalent to --psm 6 in ocrmypdf is --tesseract-psm 6.

For the WinError, try running with the argument --verbose 2. That should allow us to see what is happening immediately before this exception to resolve that issue.

You can also try running ocrmypdf --sidecar output.txt. If there are extra spaces in the sidecar file, then the problem lies with tesseract.

Eyxxxxx commented 3 years ago

Thanks for your reply!

The details are in the Reply.pdf as there's a lot of things come out after i use the argument --verbose 2. The zip contains my test material.  My tesseract engine is tesseract-ocr-w64-setup-v5.0.0-alpha.20201127. I hope these information will help you to reproduce my problem. 

Please let me know if you need any further information.  Thanks for your time again!!!

jbarlow83 commented 3 years ago

Unfortunately your attachment did not come through. I don't think Github will post email attachments. I believe you have to use the web interface to provide attached files.

Eyxxxxx commented 3 years ago

Reply.pdf test.zip Sorry, I'm new to github.

jbarlow83 commented 3 years ago

The parameter is actually --tesseract-pagesegmode, not --tesseract-psm.

If you create a pure image version of the file, Tesseract also inserts spaces when it should not. I cannot resolve the issue, because I rely on Tesseract to properly insert spaces.

For example, using the following file: input

And Tesseract

tesseract -l chi_sim input.png output pdf

Will give you a file with similar issues.

Please report the issue to github.com/tesseract-ocr/tesseract.

woaidianqian commented 3 years ago

--oem 1 --psm 6 -l chi_sim -c preserve_interword_spaces=1 parameter preserve_interword_spaces=1 can fix this problem。

pdfocr.ocr(inputpath,'ocr-'+filename,language=language0,tesseract_oem=1,tesseract_pagesegmode=6)

SimonZh1234 commented 2 years ago

I have encountered the same problems, the version of ocrmypdf is 9.6.0+dfsg on ubuntu. I use ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf as suggested, the texts in test.txt is correct, but unexpected spaces exist in test.pdf.pdf. test.zip

jbarlow83 commented 2 years ago

I can't support or fix versions as old as 9.6.0.

SimonZh1234 commented 2 years ago

@jbarlow83 I have just installed the newest version (13.4.5) of ocrmypdf via pip on ubuntu. But the problems persists: ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf gives correct test.txt but test.pdf.pdf contains extra spaces. test.zip

Kder commented 2 years ago

@jbarlow83 I have just installed the newest version (13.4.5) of ocrmypdf via pip on ubuntu. But the problems persists: ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf gives correct test.txt but test.pdf.pdf contains extra spaces. test.zip

I encountered the same issue. The "--sidecar" txt was correct but output pdf contained extra spaces. My environment is Windows 11, ocrmypdf version 13.4.7, tesseract v5.1.0.20220510.

jbarlow83 commented 2 years ago

Extra spaces in words is usually a PDF viewer issue. This is partly because PDF viewers have to decide where word breaks are - and sometimes they don't do this well. Try a different PDF viewer. In particular check Adobe Reader.

SimonZh1234 commented 2 years ago

@jbarlow83 Thanks for your advice, but I have tried Adobe Reader, Foxit Reader, Xodo and evince in this example, all the above software CANNOT copy the text WITHOUT spaces. Is it convenient for you to have a try with the zip file I uploaded?

cliveparkinson commented 2 years ago

I am experiencing the same issue of additional spaces in chi_sim text on mac running version 13.7.0 on a mac.

边疆既是一个地域概念,也是一个政治概念。就地域层面而 言,是指国家毗连边界线、与内地 〈内陆、内海) 相对而言的区 域。一般而言,历史上中国的边疆是在秦统一中原、其重心部分 形成之后确立的,有着两千多年的历史沿革。相应地,中国的边 疆研究也有着悠久的历史和优良的传统,并与国家和边疆的安危 息息相关。

边疆 既是 一 个 地 域 概念 , 也 是 一 个 政治 概念 。 就 地 域 层 面 而 言 , 是 指 国家 毗连 边界 线 、 与 内 地 〈 内 陆 、 内 海 ) 相对 而 言 的 区 域 。 一 般 而 言 , 历 史上 中 国 的 边疆 是 在 秦 统 一 中 原 、 其 重心 部 分 形成 之 后 确立 的 , 有 着 两 千 多 年 的 历史 沿革 。 相 应 地 , 中 国 的 边 疆 研 究 也 有 着 悠久 的 历史 和 优良 的 传统 , 并 与 国家 和 边疆 的 安危 息息相关 。

I guess the issue has something to do with tokenization, as the characters connected without spaces are valid tokens.

liblaf commented 1 year ago

Any workaround to get rid of spaces? 👀

ZetaLin commented 1 year ago

I wrote an article in Chinese describing almost possible solutions, but not completely solved. Non-native Chinese speakers can use translation software to convert and read it. link:https://www.cnblogs.com/issacnew/p/17468697.html

jbarlow83 commented 1 year ago

The gist of the article above is that creating a tesseract config file with the contents preserve_interword_spaces 1 will improve output in some situations.

@ZetaLin Please understand that the issue is due to Tesseract producing PDFs that some PDF readers do not interpret correctly, and no one has a solution at this time.

ZetaLin commented 1 year ago

The gist of the article above is that creating a tesseract config file with the contents preserve_interword_spaces 1 will improve output in some situations.

@ZetaLin Please understand that the issue is due to Tesseract producing PDFs that some PDF readers do not interpret correctly, and no one has a solution at this time.

Yes, I tested tesseract v5.3.1.20230401 like this:
tesseract input.png out -l chi_sim --oem 1 --psm 6 -c preserve_interword_spaces=1 pdf

I get the same result as with ocrmypdf: The output txt has no Spaces, but the text copied from the pdf still has Spaces.

Thus, this problem occurs from Tesseract NOT ocrmypdf. This conclusion needs to be known by more users.

ZetaLin commented 1 year ago

Currently, it seems that the only and not particularly good solution for ocrmypdf to make the copied text from the output pdf with no Spaces is to use oem 0 (which takes a non-LSTM model, but does not recognize well).

ocrmypdf -l chi_sim --tesseract-oem 0 input.pdf output.pdf This method directly copies the text of the pdf, there will be no Spaces, but some of the copied text is not correctly identified.

This person's test confirmed my claim: https://github.com/tesseract-ocr/tesseract/issues/2814#issuecomment-622958621

hhiyorimi commented 11 months ago

Is there some way to solve it?

jbarlow83 commented 9 months ago

1191 input requested