Open Matthias1975 opened 2 years ago
There is some prior art on this if you search around, maybe in the PRs. We have no active developers or maintainers at the moment so you'll be on your own for this.
If I understand right: You have a file, test.docx, that has the MD5 of XYZ. You open it with python-docx, and close it without changing anything, and then the hash is not XYZ anymore? And thats because python-docx alters the modified date by default?
Can you give a code example to reproduce?
If I understand right: You have a file, test.docx, that has the MD5 of XYZ. You open it with python-docx, and close it without changing anything, and then the hash is not XYZ anymore? And thats because python-docx alters the modified date by default?
No, python-docx stores a creation date in the file thus an identical document has a different MD5. Creation date has no purpose inside the file. File systems store file metadata such as creation date.
@Matthias1975 there is an active fork of docx maybe check out and report this problem on https://github.com/HiTalentAlgorithms/python-docx
Hi,
when saving docx with python-docx, the individual files in the zip file have the current date and time as the modification date.
Re-creating the files with the same input data therefore leads to non bit-identical results.
If one manages docx in version control systems, such as SVN or git, files are marked as changed, although they are identical in content.
The cause lies in the module "phys_pkg.py", class "_ZipPkgWriter", method "write". Is it possible to change the method as follows?
old:
new:
Btw, if a docx is saved with Word (Office365), the date is also always 1980-1-1.
Thanks a lot Matthias