infiniflow / ragflow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding.
https://ragflow.io
Apache License 2.0
24.19k stars 2.36k forks source link

[Bug]: Error when parsing .DOCX files when chunking method is set to Laws #3091

Open rplescia opened 1 month ago

rplescia commented 1 month ago

Is there an existing issue for the same bug?

Branch name

main

Commit ID

na

Other environment information

No response

Actual behavior

When the chunking method is set to "Laws," I cannot parse an MS Word (DOCX) document. I have tried different embedding models and chucking parameters, but it still fails. When the document is converted to PDF format, it parses fine. Capture

Expected behavior

No response

Steps to reproduce

Set up new knowledgebase with the default chunking method to Laws.
Upload a DOCX file and start parsing

Additional information

No response

KevinHuSh commented 1 month ago

Could you attach the file so I can debug it?

rplescia commented 1 month ago

Unfortunately, I cannot send the exact document because it is confidential, I will see if I can find a sample document that exhibits the same behaviour. The type of document I'm using is a facility agreement, like this https://assets.publishing.service.gov.uk/media/5a7f05b0e5274a2e8ab49acc/facility-agreement.pdf or this https://www.sec.gov/Archives/edgar/data/1415016/000119312514260282/d699526dex99b25.htm

rplescia commented 3 weeks ago

@KevinHuSh If it is any help, the same error occurs when chucking the documents using 'One' and 'Manual' methods. I still haven't figured out what about the document can be causing the issue, I have managed to find the piece of code that produces the error. My guess is that in one of the parameters it is passing a string where it is expecting an int value

try: cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"], to_page=row["to_page"], lang=row["language"], callback=callback, kb_id=row["kb_id"], parser_config=row["parser_config"], tenant_id=row["tenant_id"]) cron_logger.info( "Chunking({}) {}/{}".format(timer() - st, row["location"], row["name"])) except Exception as e: callback(-1, "Internal server error while chunking: %s" % str(e).replace("'", "")) cron_logger.error( "Chunking {}/{}: {}".format(row["location"], row["name"], str(e))) traceback.print_exc() return