hiroi-sora / Umi-OCR

OCR software, free and offline. 开源、免费的离线OCR软件。支持截屏/批量导入图片,PDF文档识别,排除水印/页眉页脚,扫描/生成二维码。内置多国语言库。
MIT License
27.12k stars 2.72k forks source link

使用pix2text 1.0版本出错 #625

Closed qwer666qwer closed 2 months ago

qwer666qwer commented 2 months ago

Issues

Umi-OCR version 程序版本

2.1.3和2.1.2都测试过

Windows version 系统版本

win11

OCR plugins Used 使用的OCR插件

Pix2Text

Reproduction steps 复现步骤

设置 如图,ocr该书的过程中卡住,等了很久也没反应。cli中的报错是:

[Error] 异步运行发生错误: Traceback (most recent call last):
  File "Umi-OCR_Paddle_v2.1.3\UmiOCR-data\py_src\utils\thread_pool.py", line 22, in run
    self._taskFunc(*self._args, **self._kwargs)
  File "Umi-OCR_Paddle_v2.1.3\UmiOCR-data\py_src\mission\mission.py", line 238, in _taskRun
    res = self.msnTask(msnInfo, msn)
  File "Umi-OCR_Paddle_v2.1.3\UmiOCR-data\py_src\mission\mission_doc.py", line 262, in msnTask
    tbs = tbpu.run(tbs)
  File "Umi-OCR_Paddle_v2.1.3\UmiOCR-data\py_src\ocr\tbpu\parser_multi_para.py", line 30, in run
    self.pp.run(tbs)  # 预测结尾分隔符
  File "Umi-OCR_Paddle_v2.1.3\UmiOCR-data\py_src\ocr\tbpu\parser_tools\paragraph_parse.py", line 61, in run
    units = self._get_units(text_blocks, self.get_info)
  File "Umi-OCR_Paddle_v2.1.3\UmiOCR-data\py_src\ocr\tbpu\parser_tools\paragraph_parse.py", line 72, in _get_units
    units.append((bbox, (text[0], text[-1]), tb))
IndexError: string index out of range

Problem screenshots or related files (optional) 问题截图或相关文件(可选)

测度论与概率论基础 (程士宏编著) (Z-Library).pdf

hiroi-sora commented 2 months ago

感谢提出,这是P2T输出项不标准导致的异常。你可以手动更新代码修复该bug:

打开 UmiOCR-data\py_src\ocr\tbpu\parser_tools\line_preprocessing.py

第85行 linePreprocessing 函数的后面,添加一行代码:

def linePreprocessing(textBlocks):
    textBlocks = [i for i in textBlocks if i.get("text", False)]

如图:

image

下个版本将更新此bug修复。

qwer666qwer commented 2 months ago

谢谢大佬修复