hiroi-sora / Umi-OCR

OCR software, free and offline. 开源、免费的离线OCR软件。支持截屏/批量导入图片,PDF文档识别,排除水印/页眉页脚,扫描/生成二维码。内置多国语言库。
MIT License
23.04k stars 2.35k forks source link

识别繁体的异常,导出的PDF很完美,但是数据里有很多多余的乱码。 #516

Closed maxin9966 closed 1 month ago

maxin9966 commented 1 month ago

Issues

Umi-OCR version 程序版本

2.1.1

Windows version 系统版本

win11

OCR plugins Used 使用的OCR插件

No response

Reproduction steps 复现步骤

PDF转换 繁体中文 多栏按自然段换行 整页强制OCR

问题描述:

只有使用【整页强制OCR】才能成功识别,其他模式导出的都是空白 【整页强制OCR】出现以下问题,PDF导出的很完美,但是txt或者json原始数据的每句话结尾大概率都有一些乱码,具体情况如下图所示

Problem screenshots or related files (optional) 问题截图或相关文件(可选)

识别导出的PDF在显示上很完美

image

但是原始数据里,每句话都有几个多余的字

{ "code": 100, "data": [{ "box": [ [61.58888888888889, 90.59814814814816], [155.3111111111111, 90.15185185185186], [155.3111111111111, 101.75555555555556], [61.58888888888889, 102.20185185185186] ], "score": 0.8367864489555359, "text": "檢視當時的想法國", "from": "ocr", "end": "\n" }, { "box": [ [84.79629629629629, 121.83888888888889], [418.17962962962963, 121.83888888888889], [418.17962962962963, 131.65740740740742], [84.79629629629629, 131.65740740740742] ], "score": 0.8287469744682312, "text": "發覺情緒後先停下來·檢視自己當時的想法為何·求思考這樣的想法對間砲砲", "from": "ocr", "end": "\n" }, { "box": [ [62.92777777777778, 143.7074074074074], [404.34444444444443, 143.7074074074074], [404.34444444444443, 152.63333333333333], [62.92777777777778, 152.63333333333333] ], "score": 0.8929890394210815, "text": "題是否有幫助·提醒自己·是否願意讓不佳的情緒影響對孩子的問題處理:", "from": "ocr", "end": "" }, { "box": [ [62.035185185185185, 174.50185185185185], [191.4611111111111, 173.60925925925926], [191.4611111111111, 182.9814814814815], [62.035185185185185, 183.8740740740741] ], "score": 0.8536674976348877, "text": "修正不適應的歸因想法祐", "from": "ocr", "end": "\n" }, { "box": [ [83.9037037037037, 205.7425925925926], [415.94814814814816, 205.7425925925926], [415.94814814814816, 214.22222222222223], [83.9037037037037, 214.22222222222223] ], "score": 0.4056791663169861, "text": "·多主裂器晶1已月鍵·孚詳節口舉5具早彈(T呈雲·彩歌詳電器旱送為", "from": "ocr", "end": "\n" }, { "box": [ [62.92777777777778, 225.82592592592593], [215.5611111111111, 225.82592592592593], [215.5611111111111, 235.64444444444445], [62.92777777777778, 235.64444444444445] ], "score": 0.8782615661621094, "text": "再重新面對孩子的問題立做處理·", "from": "ocr", "end": " " }, { "box": [ [62.92777777777778, 252.6037037037037], [221.80925925925925, 252.6037037037037], [221.80925925925925, 268.6703703703704], [62.92777777777778, 268.6703703703704] ], "score": 0.8285078406333923, "text": "大、如何用適應的歸因想法國國", "from": "ocr", "end": "\n" }, { "box": [ [102.64814814814815, 290.5388888888889], [182.53518518518518, 290.5388888888889], [182.53518518518518, 299.9111111111111], [102.64814814814815, 299.9111111111111] ], "score": 0.8204410076141357, "text": "不適應的歸因想法砲", "from": "ocr", "end": "\n" }, { "box": [ [65.60555555555555, 306.1592592592593], [172.2703703703704, 306.1592592592593], [172.2703703703704, 315.5314814814815], [65.60555555555555, 315.5314814814815] ], "score": 0.869182825088501, "text": "這個孩子怎麼這麼不乖?", "from": "ocr", "end": "" }, { "box": [ [65.60555555555555, 321.77962962962965], [152.63333333333333, 321.77962962962965], [152.63333333333333, 331.15185185185186], [65.60555555555555, 331.15185185185186] ], "score": 0.9081498980522156, "text": "他根本就是故意的!國", "from": "ocr", "end": "\n" }, { "box": [ [65.60555555555555, 337.4], [182.53518518518518, 337.4], [182.53518518518518, 346.77222222222224], [65.60555555555555, 346.77222222222224] ], "score": 0.7975088357925415, "text": "我對這個孩子實在沒撤了!發〇", "from": "ocr", "end": "\n" }, { "box": [ [65.60555555555555, 359.7148148148148], [216.9, 359.7148148148148], [216.9, 369.087037037037], [65.60555555555555, 369.087037037037] ], "score": 0.876326858997345, "text": "除了吃藥·應該沒有其他的法子了!發〇", "from": "ocr", "end": "\n" }, { "box": [ [65.60555555555555, 388.27777777777777], [181.1962962962963, 388.27777777777777], [181.1962962962963, 397.65], [65.60555555555555, 397.65] ], "score": 0.8257303237915039, "text": "這個孩子是有門缺陷(的·", "from": "ocr", "end": " " }, { "box": [ [65.60555555555555, 410.5925925925926], [171.82407407407408, 410.5925925925926], [171.82407407407408, 420.4111111111111], [65.60555555555555, 420.4111111111111] ], "score": 0.8381812572479248, "text": "這個孩子什麼都做不好·嶋", "from": "ocr", "end": "" }, { "box": [ [65.60555555555555, 426.212962962963], [162.00555555555556, 426.212962962963], [162.00555555555556, 436.0314814814815], [65.60555555555555, 436.0314814814815] ], "score": 0.9026196002960205, "text": "我真是個失敗的父母!砲", "from": "ocr", "end": "\n" }, { "box": [ [64.71296296296296, 442.72592592592594], [191.4611111111111, 442.72592592592594], [191.4611111111111, 452.09814814814814], [64.71296296296296, 452.09814814814814] ], "score": 0.9279829263687134, "text": "這個孩子會這樣都是我的錯!砲", "from": "ocr", "end": "" }, { "box": [ [61.58888888888889, 467.27222222222224], [146.8314814814815, 470.3962962962963], [146.38518518518518, 489.587037037037], [60.696296296296296, 486.01666666666665] ], "score": 0.6885399222373962, "text": "大大作業練習區", "from": "ocr", "end": "\n" }, { "box": [ [286.52222222222224, 290.5388888888889], [355.69814814814816, 290.5388888888889], [355.69814814814816, 299.9111111111111], [286.52222222222224, 299.9111111111111] ], "score": 0.7950518131256104, "text": "適應的歸囚想法發", "from": "ocr", "end": "\n" }, { "box": [ [227.1648148148148, 306.1592592592593], [382.47592592592594, 306.1592592592593], [382.47592592592594, 315.0851851851852], [227.1648148148148, 315.0851851851852] ], "score": 0.9082364439964294, "text": "很多事情不是這個孩子能約控制的+", "from": "ocr", "end": "\n" }, { "box": [ [227.1648148148148, 321.77962962962965], [413.27037037037036, 321.77962962962965], [413.27037037037036, 330.7055555555556], [227.1648148148148, 330.7055555555556] ], "score": 0.8847109079360962, "text": "他其實也不是故意·這些都是立狀造成的·國", "from": "ocr", "end": "" }, { "box": [ [227.61111111111111, 337.8462962962963], [413.27037037037036, 337.8462962962963], [413.27037037037036, 346.77222222222224], [227.61111111111111, 346.77222222222224] ], "score": 0.9650014638900757, "text": "應該有其他的方法來解決·我該再試看看·", "from": "ocr", "end": " " }, { "box": [ [226.2722222222222, 353.02037037037036], [415.94814814814816, 352.1277777777778], [415.94814814814816, 361.9462962962963], [226.2722222222222, 362.3925925925926] ], "score": 0.9278918504714966, "text": "吃藥尺是治療計畫的一個部分·而非下答", "from": "ocr", "end": "" }, { "box": [ [227.1648148148148, 366.4092592592593], [253.9425925925926, 366.4092592592593], [253.9425925925926, 376.22777777777776], [227.1648148148148, 376.22777777777776] ], "score": 0.22002696990966797, "text": "業北·", "from": "ocr", "end": "\n" }, { "box": [ [227.1648148148148, 382.02962962962965], [414.60925925925926, 382.02962962962965], [414.60925925925926, 390.9555555555556], [227.1648148148148, 390.9555555555556] ], "score": 0.8889887928962708, "text": "我該接受孩子真實的樣子·其實他也有很多園業", "from": "ocr", "end": "" }, { "box": [ [227.1648148148148, 394.97222222222223], [264.2074074074074, 394.97222222222223], [264.2074074074074, 404.7907407407408], [227.1648148148148, 404.7907407407408] ], "score": 0.8324491381645203, "text": "優點的·國", "from": "ocr", "end": "\n" }, { "box": [ [227.1648148148148, 410.5925925925926], [413.27037037037036, 410.5925925925926], [413.27037037037036, 420.4111111111111], [227.1648148148148, 420.4111111111111] ], "score": 0.9247992634773254, "text": "我應該著重孩子的優點·列尺看他的缺點、", "from": "ocr", "end": "\n" }, { "box": [ [227.1648148148148, 426.212962962963], [393.6333333333333, 426.212962962963], [393.6333333333333, 436.0314814814815], [227.1648148148148, 436.0314814814815] ], "score": 0.9061799645423889, "text": "這個孩子比起其他孩子是更具挑戰的“國", "from": "ocr", "end": "" }, { "box": [ [226.2722222222222, 441.8333333333333], [342.7555555555556, 441.8333333333333], [342.7555555555556, 450.75925925925924], [226.2722222222222, 450.75925925925924] ], "score": 0.9286754131317139, "text": "誰都不知道孩子會出問題、", "from": "ocr", "end": "\n" }, { "box": [ [83.9037037037037, 503.8685185185185], [418.6259259259259, 503.8685185185185], [418.6259259259259, 513.2407407407408], [83.9037037037037, 513.2407407407408] ], "score": 0.9499539136886597, "text": "現在我們已經知道了感受與想法間的關聯性·在這一個星期中·我們可以佔", "from": "ocr", "end": "\n" }, { "box": [ [62.92777777777778, 524.8444444444444], [416.8407407407407, 524.8444444444444], [416.8407407407407, 533.7703703703704], [62.92777777777778, 533.7703703703704] ], "score": 0.9447243809700012, "text": "試著當面對孩子的間題情境而引發負面情緒時·去辦識當時的想法及其合理性·", "from": "ocr", "end": " " }, { "box": [ [62.92777777777778, 545.3740740740741], [415.94814814814816, 545.3740740740741], [415.94814814814816, 554.7462962962964], [62.92777777777778, 554.7462962962964] ], "score": 0.9341756701469421, "text": "芷思考和取代以較合宜的想法·之後再感受看看是否能讓自已的情緒較為緩和·", "from": "ocr", "end": " " }, { "box": [ [63.37407407407407, 565.4574074074075], [194.13888888888889, 565.4574074074075], [194.13888888888889, 575.2759259259259], [63.37407407407407, 575.2759259259259] ], "score": 0.9160681962966919, "text": "更能理性的面對孩子的間題!", "from": "ocr", "end": "\n" }, { "box": [ [37.93518518518518, 617.674074074074], [293.662962962963, 617.674074074074], [293.662962962963, 626.6], [37.93518518518518, 626.6] ], "score": 0.7932961583137512, "text": "oi2JADHD兒童認知行為親子團體治療·父母手冊·(精簡版)臨", "from": "ocr", "end": "\n" }], "time": 0.9359145164489746, "timestamp": 1716194101.0420148, "page": 14, "fileName": "14", "path": "C:/Users/ma/Desktop/txt/ADHD兒童認知行為親子團體治療:父母手冊(精簡版).pdf" }

maxin9966 commented 1 month ago

哦,我知道了,原来是多层的pdf,我看到的是原始图片覆盖在上面

maxin9966 commented 1 month ago

现在这个识别问题有什么好方案吗?

hiroi-sora commented 1 month ago

你可以使用忽略区域功能(点表格中的文件名进入设置,右键拖拽建立选区),将主要内容以外的部分全部划为忽略区域。重复的页眉、页脚部分也可以划掉。这样可以让识别内容 减少被无关文本所干扰。

image