infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
18.52k stars 1.88k forks source link

[Bug]: Unable to parse ppt and pptx files (not missing *.so) #1210

Open fire717 opened 3 months ago

fire717 commented 3 months ago

Is there an existing issue for the same bug?

Branch name

main

Commit ID

8d667d5abdbd185e60140b22150851aa8e6a3ebe

Other environment information

No response

Actual behavior

init server from docker, then upload ppt file, then try to parse, got error

.pdf is ok .ppt got: ERROR]Internal server error: File is not a zip file .pptx got :[ERROR]Internal server error: unsupported operand type(s) for //: NoneType and int

Expected behavior

No response

Steps to reproduce

git clone main
then init docker and upload ppt

Additional information

No response

guoyuhao2330 commented 3 months ago

Is there an existing issue for the same bug?

  • [x] I have checked the existing issues.

Branch name

main

Commit ID

8d667d5

Other environment information

No response

Actual behavior

init server from docker, then upload ppt file, then try to parse, got error

.pdf is ok .ppt got: ERROR]Internal server error: File is not a zip file .pptx got :[ERROR]Internal server error: unsupported operand type(s) for //: NoneType and int

Expected behavior

No response

Steps to reproduce

git clone main
then init docker and upload ppt

Additional information

No response

can't reproduce image image

stephenlzc commented 2 months ago

Same problem.

Once I set up the Ragflow by using the method from README. There is not change.

When I upload a pptx file, the parse shows error.

image

Here is the log from server: Traceback (most recent call last): File "/ragflow/rag/svr/task_executor.py", line 146, in build cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"], File "/ragflow/rag/app/presentation.py", line 105, in chunk for pn, (txt, img) in enumerate(ppt_parser( File "/ragflow/rag/app/presentation.py", line 27, in call txts = super().call(fnm, from_page, to_page) File "/ragflow/deepdoc/parser/ppt_parser.py", line 54, in call for shape in sorted( File "/ragflow/deepdoc/parser/ppt_parser.py", line 55, in slide.shapes, key=lambda x: (x.top // 10, x.left)): TypeError: unsupported operand type(s) for //: 'NoneType' and 'int'

My computer is i7-12600 + 64RAM + 4060Ti 16G, and already install well Ubuntu Server, Docker, and other containers are all running well.