Open rplescia opened 1 month ago
Could you attach the file so I can debug it?
Unfortunately, I cannot send the exact document because it is confidential, I will see if I can find a sample document that exhibits the same behaviour. The type of document I'm using is a facility agreement, like this https://assets.publishing.service.gov.uk/media/5a7f05b0e5274a2e8ab49acc/facility-agreement.pdf or this https://www.sec.gov/Archives/edgar/data/1415016/000119312514260282/d699526dex99b25.htm
@KevinHuSh If it is any help, the same error occurs when chucking the documents using 'One' and 'Manual' methods. I still haven't figured out what about the document can be causing the issue, I have managed to find the piece of code that produces the error. My guess is that in one of the parameters it is passing a string where it is expecting an int value
try: cks = chunker.chunk(row["name"], binary=binary, from_page=row["from_page"], to_page=row["to_page"], lang=row["language"], callback=callback, kb_id=row["kb_id"], parser_config=row["parser_config"], tenant_id=row["tenant_id"]) cron_logger.info( "Chunking({}) {}/{}".format(timer() - st, row["location"], row["name"])) except Exception as e: callback(-1, "Internal server error while chunking: %s" % str(e).replace("'", "")) cron_logger.error( "Chunking {}/{}: {}".format(row["location"], row["name"], str(e))) traceback.print_exc() return
Is there an existing issue for the same bug?
Branch name
main
Commit ID
na
Other environment information
No response
Actual behavior
When the chunking method is set to "Laws," I cannot parse an MS Word (DOCX) document. I have tried different embedding models and chucking parameters, but it still fails. When the document is converted to PDF format, it parses fine.
Expected behavior
No response
Steps to reproduce
Additional information
No response