harshankur / officeParser

A Node.js library to parse text out of any office file. Currently supports docx, pptx, xlsx and odt, odp, ods..
MIT License
123 stars 17 forks source link

Your file officeParserTemp/tempfiles/x.docx seems to be corrupted. #28

Closed p-payet closed 6 months ago

p-payet commented 6 months ago

Hello,

I encounter this error when I try to extract the text from a docx document:

[OfficeParser]: Your file officeParserTemp/tempfiles/x.docx seems to be corrupted. If you are sure it is fine, please create a ticket in Issues on github with the file to reproduce error.

Here is the document:

test.docx

I have no problem opening the document in Word as well as in Libre Office.

Thanks in advance, regards.

harshankur commented 6 months ago

Hi @OGPayet, I checked your file. It seems a bit weird. All specifications of docx file say that its main content file is word/document.xml. However, your file has document2.xml for some reason. I also noticed that after I make any small change, like adding a letter somewhere in the document and saving it on MS Word, it saves the file with the correct specification, i.e., with word/document.xml instead of word/document2.xml. It certainly is not in the correct format. However, since Word and Libre Office support this, I will add a regex to check for existence of all document(number).xml instead of just the single file document.xml. That should help you out.

In the meantime, can you check again by saving the file yourself once and then trying to parse it using officeParser?