opendatalab / MinerU

A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
https://opendatalab.com/OpenSourceTools
GNU Affero General Public License v3.0
11.38k stars 853 forks source link

新版本运行出现bug:IndexError: index 10 is out of bounds for axis 0 with size 10 #627

Open Maple0709 opened 1 day ago

Maple0709 commented 1 day ago

Description of the bug | 错误描述

Traceback (most recent call last): File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/data/MinerU/app.py", line 61, in file_extract pipe.pipe_analyze(pdf_bytes, pdf_type) File "/data/MinerU/magic_pdf/pipe/UNIPipe.py", line 69, in pipe_analyze self.model_list = doc_analyze(pdf_bytes, self.ocr_custom_model, ocr=True,isimage=False, File "/data/MinerU/magic_pdf/model/doc_analyze_by_custom_model.py", line 136, in doc_analyze result = custom_model(img) File "/data/MinerU/magic_pdf/model/pdf_extract_kit.py", line 351, in call ocr_res = self.ocr_model.ocr(new_image, mfd_res=adjusted_mfdetrec_res)[0] File "/data/MinerU/magic_pdf/model/pek_sub_modules/self_modify.py", line 290, in ocr dt_boxes, recres, = self.call(img, cls, mfd_res=mfd_res) File "/data/MinerU/magic_pdf/model/pek_sub_modules/self_modify.py", line 371, in call rec_res, elapse = self.text_recognizer(img_crop_list) File "/opt/mineru_venv/lib/python3.10/site-packages/paddleocr/tools/infer/predict_rec.py", line 630, in call rec_res[indices[beg_img_no + rno]] = rec_result[rno] IndexError: index 10 is out of bounds for axis 0 with size 10

How to reproduce the bug | 如何复现

新版本中,使用多线程执行应用的时候,会出现IndexError: index 10 is out of bounds for axis 0 with size 10

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.8.x

Device mode | 设备模式

cuda

myhloli commented 1 day ago

报错的pdf文件在单线程会触发这个问题吗?

myhloli commented 1 day ago

最近的版本修改了magic_pdf/model/pdf_extract_kit.py和magic_pdf/model/pek_sub_modules/self_modify.py的一些代码,看了你的报错,代码行数和最新的版本对不上,可以尝试更新到最新版本再进行测试