python / cpython

The Python programming language
https://www.python.org
Other
61.98k stars 29.8k forks source link

Create a pure Python zipfile/tarfile importer #61830

Closed brettcannon closed 9 years ago

brettcannon commented 11 years ago
BPO 17630
Nosy @warsaw, @brettcannon, @rhettinger, @gpshead, @jcea, @amauryfa, @ericvsmith, @phmc
Files
  • zip_importlib.diff
  • zip_importlib.diff: Added support for bytecode loading
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = created_at = labels = ['type-feature', 'library'] title = 'Create a pure Python zipfile/tarfile importer' updated_at = user = 'https://github.com/brettcannon' ``` bugs.python.org fields: ```python activity = actor = 'gregory.p.smith' assignee = 'none' closed = True closed_date = closer = 'gregory.p.smith' components = ['Library (Lib)'] creation = creator = 'brett.cannon' dependencies = [] files = ['30659', '30660'] hgrepos = [] issue_num = 17630 keywords = ['patch'] message_count = 12.0 messages = ['185967', '185968', '186002', '191546', '191554', '191558', '191580', '192140', '202446', '208829', '212329', '240730'] nosy_count = 9.0 nosy_names = ['barry', 'brett.cannon', 'rhettinger', 'gregory.p.smith', 'jcea', 'amaury.forgeotdarc', 'eric.smith', 'pconnell', 'dmi.baranov'] pr_nums = [] priority = 'low' resolution = 'rejected' stage = 'test needed' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue17630' versions = ['Python 3.5'] ```

    brettcannon commented 11 years ago

    I'm going to write an importer using zipfile and importlib in pure Python. I'm doing this so that (a) there is an importer that relies on zipfile and thus what features it adds over time, (b) to have good example code to point people to when they need to implement their own importers, and (c) for me to see where there are API holes in importlib and the ABCs in order to continue to try and make things easier for people.

    Do realize that the last point means I will most likely be writing this myself no matter what people contribute (sorry). If people still want to tackle it to provide feedback on what importlib is lacking that's totally fine and welcome feedback, but since this is partially an exercise for myself I feel like I do need to type it all out from scratch to discover sticking points, even from an experienced perspective. But I'm happy to discuss design choices, etc. here with folks as this progresses.

    brettcannon commented 11 years ago

    I should mention I have an old implementation at https://code.google.com/p/importers/.

    rhettinger commented 11 years ago

    Thanks Brett. I look forward to this.

    brettcannon commented 11 years ago

    Here is an initial stab at a zip file importer using importlib. Probably the biggest shortcoming is that it doesn't support bytecode files, but that's because I just have not bothered to add support yet (it's just one method to implement). There is a note in zipimport that the resolution does not match up between zip file modification times and what bytecode files store and so there needs to be a one second fuzzing factor but I'm not seeing why based on the fact that os.stat().st_mtime is used by bytecode files which has a one second resolution already like zip files. The other shortcoming is bytecode-only files are not supported (on purpose as importlib.abc.SourceLoader doesn't support bytecode-only files).

    brettcannon commented 11 years ago

    I went ahead and added the needed method for bytecode loading; see the new patch.

    amauryfa commented 11 years ago

    Times in a ZIP files have a two-seconds resolution (in the old DOS FAT format: 5 bits for the hours, 6 bits for the minutes, and only 5 bits left for the seconds) Some fuziness logic is needed when comparing against a time_t.

    brettcannon commented 11 years ago

    A two second granularity? Ugh. That will require a new (possibly private) API to abstract that out in order to handle that case. What a pain. While I look into that if people can look at the code and let me know if they see anything wrong with it, hard to understand, or a way to improve I would appreciate it.

    brettcannon commented 11 years ago

    Paul's review did find one non-optimal thing the code is doing. I'll fix it the next time there's a need to upload a new patch.

    brettcannon commented 10 years ago

    This is going to need a touch-up for PEP-451 to make the whole thing work more smoothly, so repositioning for Python 3.5. When that happens I can look into the 2 second issue that Amaury pointed out and Paul's review.

    Although I do wonder about the utility of this work. Zipimport still gets around the bootstrapping problem that any pure Python zip implementation never will. Probably the only hope of being useful is to make this abstract enough to include a pluggable (de)compression setup so that lzma or bz2 can be used on top of a single file storage back-end like tar or zip.

    brettcannon commented 10 years ago

    Re-positioning to work with both tarfile and zipfile since tarfile's 'r' will transparently decompress as necessary. Might need to scale back some functionality to make it easily work with both formats. But since are both alternative storage solutions then some generic base classes should be possible (which might require some refactoring anyway to make it all abstract in comparison to the storage mechanism.

    brettcannon commented 10 years ago

    The more I think about this, the less useful it seems to be. We need zipimport written in C for bootstrapping issues. If that's the case then time should be put into making that work and not duplicating the functionality. I'm going to leave this open but unassign from myself as I would rather focus on my lazy loader issue than this. If someone can manage to create a zipfile importer whose dependencies are small and thus can be frozen along with importlib then this can be moved forward.

    gpshead commented 9 years ago

    Based on our hallway pow-wow at PyCon 2015 sprints day #1... I audited the zipfile module to confirm our suspicions about it being "large".

    In current Python 3.5 head's zipfile.py module here are the things it depends directly upon from other modules:

    import io  # [Py(C: _io)] io.BufferedIOBase{,readline}, io.BytesIO, io.open
    import os  # [Py(C: various)] os.path.*, os.getcwd, os.stat, os.listdir, curdir, pardir, 
    import re  # [Py(C: sre_*)] Only used for universal newlines in ZipExtFile.readline().  Shouldn't the io module do this part for us?!?
    import importlib.util  # [Py] importib.util.cache_from_source() from PyZipFile to create importable .zip files.
    import sys  # [C] sys.platform for creation metadata, sys.stderr for a strange print on a DeprecationWarning.
    import time  # [C] time.localtime, time.time
    import stat  # [Py] stat.filemode, stst.S_ISDIR
    import shutil  # [Py] shutil.copyfileobj() from _extract_member().
    import struct  # [Py(C: _struct)] struct.pack(), struct.unpack(), struct.calcsize()
    import binascii  # [C] binascii.crc32() only if zlib is unavailable. (store only zip support?)
    import threading  # [Py] threading.RLock() in ZipFile.
    
    import zlib, bz2, lzma are all conditional.  # [C]
    
    import warnings  # [Py] conditional import to highlight legacy uses.
    import py_compile  # [Py] conditional import in PyZipFile for importable .zip file creation.

    Some of these are obviously shims around C extensions which could conditionally be used directly if desired. But many others are largely implemented in Python. Freezing all of these just to use the bloated zipfile.py within a pure Python zipimport implementation seems like extreme overkill. Not worth the effort.

    Efforts for improved zipimport are now likely to focus on a simple C zip file reading-only library being used by a new clean implementation of a zip importer on top of that.