Closed 1d061fed-2a71-4857-94c9-27ccf791e5bb closed 7 years ago
As promised in this thread http://mail.python.org/pipermail/python-dev/2009-August/091450.html (a year ago!), attached is a patch that replaces simple zip decryption logic written in pure python with that in C.
As reported in the link above, this can result in speedups up to couple of orders of magnitude.
There doesn't seem to be any need to add any new tests as this patch doesn't change any public API
It would be nice to retain the pure python version as a fallback for non CPython implementations, that will require tweaking the tests to make sure both are tested.
I have updated the patch with a check for the availability of C impl and to use pure-py impl as a fallback.
How do you suggest would the tests change? As I had mentioned before, in my understanding since there is no change in the API the already existing tests should work.
One can simulate the absence of C impl in a system where it is present but AFAIU this is not what it is usually done (e.g, in the case of optional zlib dependency in the same module)
It is what is normally done *now* when there is both a C and a python implementation (see, for example, test_datetime.py and test_io.py for two different approaches to that). Not all tests have been updated to this practice.
Thanks for working on this.
Attached is a patch with changes in Lib/test/test_zipfile.py to test both C and pure-py impls (on systems where the C impl is present).
Admittedly, this approach to emulating the absence of C impl is a bit hacky. This is primarily because the changed class is not a part of the public API and hence not being tested directly.
David, could you verify that the approach is ok?
Hello,
Some quick comments:
the C module should be private and therefore called _zipdecrypt done
if you want to avoid API mismatch, you could give a tp_call to your C >decrypter object, rather than a "decrypt" method done
- you can put all initialization code in zipdecrypt_new and avoid the >need for zipdecrypt_init keeping this similar to the existing _ZipDecrypter class in ZipFile (all initialization in init rather than new), which was probably to allow re-init and re-use of one instance
it's better to use the "y" code in PyArg_ParseTuple, rather than "s#" y does not seem to be available in 2.7, using s* instead
you should define your module as PY_SSIZE_T_CLEAN and use Py_ssize_t >as length variables (rather than int) done
you *mustn't change the contents of the buffer which is given you by >"s#" or "y", since that buffer is read-only (it can be a bytes >object); instead, create a new bytes object using >PyBytes_FromStringAndSize(NULL, length) and write into that; or, if >you want a read-write buffer, use the "w" code corrected, not altering the input buffer, reading input buffer as s\
I had uploaded an incorrect patch. New corrected patch against trunk (on Mac OS uploaded).
Attached you will find the updated patch for the python 3 tree as of now.
I have measured a speed-up of more than a factor 100.
Patch for python 2.7
Same patch as for python 3, backported to python 2.7
Tested on Linux only.
I quote from Gregory P. Smith (msg91897):
""" The decryption provided by the zipfile module is for the worthless 32-bit crc based "encryption" of zipfiles. I think promoting the use of that is a bad idea.
zipfile can be used by people to get their data out of such files. We should not encourage them to put it and/or their code into such a stupid format. """
I think that the effort required for speedup of this almost useless feature is excessive.
If someone want to implement the "strong encryption" for zip files - welcome.
See also criticism in the original discussion: http://mail.python.org/pipermail/python-dev/2009-August/091450.html .
My use case is decrypting files of 100's of megabytes. This is so slow that it is quite useless. About an hour or so.
I do agree that the encryption is worthless, but that is not important for my use case where I want to discourage people from reverse engineering the contents. If it is so dangerous as some people have pointed out, it should be removed at the cost of not supporting a standard feature of ZIP files. In my opinion you either support a feature and you support it good (efficient) or you don't. As it stands now, users will be disappointed in using a supported feature.
Some people argue that adding C code to Python is dangerous as it will lead to bugs, vulnerabilities etc. You could dismiss every addition with C code to Python with this argument, so there must be some positive aspects to outweigh the negative side. The negative side is fairly small (50 lines of very simple C code), plus some standard Python glue code. The benefit is a 100 fold increase in performance and the removal of 1 line of documentation telling that this feature is extremely slow. (patch attached)
We aren't particularly interested in helping people make their files slightly harder to reverse engineer, either, so I don't think that is a good enough reason for accepting this. There might be other reasons that are good enough, but I don't think that is one of them.
From the zlib FAQ:
How can I encrypt/decrypt zip files with zlib?
zlib doesn't support encryption. The original PKZIP encryption is very weak and can be broken with freely available programs. To get strong encryption, use GnuPG, http://www.gnupg.org/ , which already includes zlib compression. For PKZIP compatible "encryption", look at http://www.info-zip.org/
I don't see the point of a weak and easily breakable encryption.
If the encryption is so horrible why is there any support (with bad performance) at all in Python? It would be better to remove it altogether. This prevents users from building software using this feature only to find out later how bad the performance is. (This is the main reason why I have submitted this patch.) If the support had not been in Python I would not have used this feature to begin with.
To reiterate my previous point. Either support something and do it well, or don't. Don't do a half job.
Please note that there are more attempts to fix this, so I am not completely alone here. http://pypi.python.org/pypi/czipfile
If the encryption is so horrible why is there any support (with bad performance) at all in Python? It would be better to remove it altogether.
We don't remove it as it would break existing programs which rely on this feature. However adding a bunch of C code to improve performance of such a questionable feature is controversial.
I wouldn't be against the acceleration myself, however I am not interested in reviewing or accepting it, sorry.
Here is a patch which optimize (speed up 2x) Python implementation of ZIP decryptor. It is almost the maximum of what can be achieved without significant degradation of maintainability.
Of course, 2x is less then 100x, but it more portable and costs almost nothing. If that's not enough, I suggest to use an external unzip. You can even run multiple unzips at a time for the parallel extraction of multiple files.
If in the future someone will implement the strong encryption for ZIP files, it is possible it will required a C accelerator module and it is possible there will be a place for PKWARE's traditional encryption.
If the encryption is so horrible why is there any support (with bad performance) at all in Python?
I would say it there so that people can use python to "decrypt" an "encrypted" zip archive they have been sent that was generated by some other tool. I would say this is not a particularly strong use case since there are other tools that can be used for this, but as Antoine said removing the "feature" could break existing working code, so we are unlikely to do it.
The current situation is now that the decryption is part of Python. It is well known to be computationally intensive and should therefore be implemented in C.
This patch provides that support.
The discussion if Python should support the decryption is behind us, as the support is already in.
The only discussion should be about if there are enough users wanting this performance improvement to add it.
From the download statistics of czipfile I would say that there are roughly 2500 users interested.
Brian, you requested tests, but PR 550 doesn't add new API and doesn't change the behavior of public API. No new tests are needed.
New changeset 06e522521c06671b4559eecf9e2a185c2d62c141 by Serhiy Storchaka in branch 'master': bpo-10030: Sped up reading encrypted ZIP files by 2 times. (#550) https://github.com/python/cpython/commit/06e522521c06671b4559eecf9e2a185c2d62c141
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['3.7', 'library', 'performance']
title = 'Patch for zip decryption speedup'
updated_at =
user = 'https://bugs.python.org/shashank'
```
bugs.python.org fields:
```python
activity =
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date =
closer = 'serhiy.storchaka'
components = ['Library (Lib)']
creation =
creator = 'shashank'
dependencies = []
files = ['19135', '19149', '19196', '19197', '19347', '19495', '27867', '27868', '27881', '27882']
hgrepos = []
issue_num = 10030
keywords = ['patch']
message_count = 22.0
messages = ['118030', '118039', '118110', '118112', '118405', '118420', '119476', '120417', '174699', '174700', '174789', '174791', '174793', '174794', '174795', '174803', '174806', '174816', '174823', '174830', '290497', '290857']
nosy_count = 7.0
nosy_names = ['pitrou', 'christian.heimes', 'r.david.murray', 'brian.curtin', 'shashank', 'serhiy.storchaka', 'rhdv']
pr_nums = ['550']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'performance'
url = 'https://bugs.python.org/issue10030'
versions = ['Python 3.7']
```