simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
843 stars 55 forks source link

extra spaces in result when ocr chinese #443

Closed ffchung closed 9 months ago

ffchung commented 9 months ago

Which SIST2 component is your Feature Request related to?

Scan with ocr image and ocr ebook

Is your feature request related to a problem? Please describe.

Ref to : https://github.com/tesseract-ocr/tesseract/issues/991

I need to pass the setting preserve_interword_spaces=1 to tesseract.

What would you like to see happen?

chinese ocr with extra spaces.

Additional context

simon987 commented 9 months ago

Thanks,

Fixed in 2936240

image

Before:

伦敦 楼 房 发 生火 灾 中 使 馆 关 注 : 暂 无 中 国 公民 受伤

After:

伦敦楼房发生火灾中使馆关注 : 暂无中国公民受伤