Closed osafaimal closed 4 months ago
Hey @osafaimal, great to see you back! Hope you're doing well.
Based on the error message you're receiving, it seems like the issue might be related to the handling of the .docx
file format during the file reading process. .docx
files are essentially ZIP archives containing XML and other files that make up the document structure. The error message indicates that the process attempting to open the .docx
file does not recognize it as a valid ZIP archive.
Here are a few steps you can take to troubleshoot this issue:
Verify .docx
File Integrity: Ensure that the .docx
files you are trying to load are not corrupted and can be opened with standard document viewing software like Microsoft Word or LibreOffice Writer.
Check DocxReader
Implementation: The DocxReader
class in LlamaIndex uses the docx2txt
library to process .docx
files. It's possible that there might be an issue with how DocxReader
is handling the files. Here's the implementation of the DocxReader
class for reference:
class DocxReader(BaseReader):
"""Docx parser."""
def load_data(
self, file: Path, extra_info: Optional[Dict] = None
) -> List[Document]:
"""Parse file."""
try:
import docx2txt
except ImportError:
raise ImportError(
"docx2txt is required to read Microsoft Word files: "
"`pip install docx2txt`"
)
text = docx2txt.process(file)
metadata = {"file_name": file.name}
if extra_info is not None:
metadata.update(extra_info)
return [Document(text=text, metadata=metadata or {})]
(Source: llama-index-legacy/llama_index/legacy/readers/file/docs_reader.py)
docx2txt
, are correctly installed and compatible with each other and with your Python version.If you've checked all of these and the issue persists, it would be helpful if you could provide more information about the specific .docx
files you're trying to load and any additional details about your environment. This will help in further diagnosing the issue.
so 1. i can open them in Word 2. I don't know how to Check DocxReader Implementation but i believe it should be okay otherwise many others people should have the problem 3. I don't use zip so i don't understand why it mentions zip.
I think its just failing to open the format. Something about the fileformat is not compatible with docxt2txt
ok i will search. If you have idea about how can i troubleshoot.
Bug Description
for all my .docx file i have this error: Failed to load file [...].docx with error: File is not a zip file. Skipping... for loading i use: documents = SimpleDirectoryReader("data",recursive=True).load_data() i have installed docx2txt
Version
0.10.18
Steps to Reproduce
just do that (i think at least when i just do that i have the error):
Relevant Logs/Tracbacks