python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.38k stars 1.08k forks source link

customXML Error #1394

Closed lnkirkham-datasparq closed 2 months ago

lnkirkham-datasparq commented 2 months ago

I'm creating a document from a local .docx file, which I've done with a number of similar files without any issue:

local path = "path/to/my/file.docx"
my_document = Document(local_path)

But I'm encountering the following error with one file in particular:


    my_document = Document(local_path)
  File ".../miniconda3/envs/euclid-env/lib/python3.10/site-packages/docx/api.py", line 27, in Document
    document_part = cast("DocumentPart", Package.open(docx).main_document_part)
  File ".../miniconda3/envs/my-env/lib/python3.10/site-packages/docx/opc/package.py", line 127, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File ".../miniconda3/envs/my-env/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 25, in from_file
    sparts = PackageReader._load_serialized_parts(
  File ".../miniconda3/envs/my-env/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 53, in _load_serialized_parts
    for partname, blob, reltype, srels in part_walker:
  File ".../miniconda3/envs/my-env/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 86, in _walk_phys_parts
    for partname, blob, reltype, srels in next_walker:
  File ".../miniconda3/envs/my-env/lib/python3.10/site-packages/docx/opc/pkgreader.py", line 81, in _walk_phys_parts
    blob = phys_reader.blob_for(partname)
  File ".../miniconda3/envs/my-env/lib/python3.10/site-packages/docx/opc/phys_pkg.py", line 83, in blob_for
    return self._zipf.read(pack_uri.membername)
  File ".../miniconda3/envs/my-env/lib/python3.10/zipfile.py", line 1485, in read
    with self.open(name, "r", pwd) as fp:
  File ".../miniconda3/envs/my-env/lib/python3.10/zipfile.py", line 1524, in open
    zinfo = self.getinfo(name)
  File ".../miniconda3/envs/my-env/lib/python3.10/zipfile.py", line 1451, in getinfo
    raise KeyError(
KeyError: "There is no item named 'customXML/item3.xml' in the archive"

Currently using `python-docx==0.8.11`

Is there something in particular with this file that is causing the issue?
Any advice on how to overcome the error?

Thanks
scanny commented 2 months ago

Hi @lnkirkham-datasparq, this would indicate a partial corruption in the .docx file.

Basically one of the document "parts" (sub-files, often XML) is indicating a relationship to another part (customXML/item3.xml in this case) but that part is not present in the package (zip archive).

Quick fix if it's a onesie-twosie is probably to load it with Word, have it repair the file if it complains, maybe make a trival change like add a space then delete it (but don't undo it), then save the file.

If for whatever reason that doesn't suit then you'll need to remove the relationship in question. python-opc can be helpful for that but you'll need to install it from the develop branch as it hasn't been released lately and the PyPI version won't work with Python 3.

I realize this response may use terms you're not familiar with, feel free to ask any questions and I can explain or point you to resources.

lnkirkham-datasparq commented 2 months ago

Thanks for the quick response @scanny, I'll give the quick fix a go, but may follow up with a qus if I go town the python-opc route!

lnkirkham-datasparq commented 2 months ago

The quick first worked and is suitable for the small batch of docs I'm working with. Thanks again!

scanny commented 2 months ago

Glad you got it working Louise :)