infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
18.18k stars 1.84k forks source link

[Bug]: pdf parser IndexError: string index out of range #2559

Open wangguo1230 opened 3 days ago

wangguo1230 commented 3 days ago

Is there an existing issue for the same bug?

Branch name

main

Commit ID

0cb588f7

Other environment information

OS type: Windows 11

Actual behavior

解析pdf时发生错误,Traceback (most recent call last): File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 1175, in test = ragflow("C:\Users\wangg\Desktop\3.pdf") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 1031, in call self._concat_downward() File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 516, in _concat_downward dfs(boxes[0], 1) File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 507, in dfs fea = self._updown_concat_features(up, down) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 113, in _updown_concat_features up["text"][-1] + down["text"][0]) else "") \


IndexError: string index out of range

### Expected behavior

_No response_

### Steps to reproduce

```Markdown
if __name__ == "__main__":
    ragflow = RAGFlowPdfParser()
    test = ragflow("3.pdf")
```

### Additional information

[Uploading 3.pdf…]()
Feiue commented 2 days ago

Is there an existing issue for the same bug?

  • [x] I have checked the existing issues.

Branch name

main

Commit ID

0cb588f

Other environment information

OS type: Windows 11

Actual behavior

解析pdf时发生错误,Traceback (most recent call last): File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 1175, in test = ragflow("C:\Users\wangg\Desktop\3.pdf") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 1031, in call self._concat_downward() File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 516, in _concat_downward dfs(boxes[0], 1) File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 507, in dfs fea = self._updown_concat_features(up, down) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\pythonprojects\ragflow\deepdoc\parser\pdf_parser.py", line 113, in _updown_concat_features up["text"][-1] + down["text"][0]) else "") ~~^^^^ IndexError: string index out of range

Expected behavior

No response

Steps to reproduce

if __name__ == "__main__":
    ragflow = RAGFlowPdfParser()
    test = ragflow("3.pdf")

Additional information

Uploading 3.pdf…

PDF link is incorrect.

wangguo1230 commented 1 day ago

Sorry this 6.pdf @Feiue