python-openxml / python-docx

Create and modify Word documents with Python
MIT License
4.61k stars 1.13k forks source link

Creating bit-identical docx files #1042

Open Matthias1975 opened 2 years ago

Matthias1975 commented 2 years ago

Hi,

when saving docx with python-docx, the individual files in the zip file have the current date and time as the modification date.

Re-creating the files with the same input data therefore leads to non bit-identical results.

If one manages docx in version control systems, such as SVN or git, files are marked as changed, although they are identical in content.

The cause lies in the module "phys_pkg.py", class "_ZipPkgWriter", method "write". Is it possible to change the method as follows?

old:

def write(self, pack_uri, blob):
    self._zipf.writestr(pack_uri.membername, blob)

new:

def write(self, pack_uri, blob):
    zinfo = ZipInfo(filename=pack_uri.membername, date_time=[1980, 1, 1, 0, 0, 0])
    zinfo.compress_type=ZIP_DEFLATED

    self._zipf.writestr(zinfo, blob)

Btw, if a docx is saved with Word (Office365), the date is also always 1980-1-1.

Thanks a lot Matthias

scanny commented 2 years ago

There is some prior art on this if you search around, maybe in the PRs. We have no active developers or maintainers at the moment so you'll be on your own for this.

AltayAkkus commented 2 years ago

If I understand right: You have a file, test.docx, that has the MD5 of XYZ. You open it with python-docx, and close it without changing anything, and then the hash is not XYZ anymore? And thats because python-docx alters the modified date by default?

Can you give a code example to reproduce?

CanIGetaPR commented 2 years ago

If I understand right: You have a file, test.docx, that has the MD5 of XYZ. You open it with python-docx, and close it without changing anything, and then the hash is not XYZ anymore? And thats because python-docx alters the modified date by default?

No, python-docx stores a creation date in the file thus an identical document has a different MD5. Creation date has no purpose inside the file. File systems store file metadata such as creation date.

@Matthias1975 there is an active fork of docx maybe check out and report this problem on https://github.com/HiTalentAlgorithms/python-docx