python / cpython

The Python programming language
https://www.python.org
Other
63.34k stars 30.32k forks source link

[Feature Request]: Add zstd support in tarfile #81276

Open ad12ffa1-c51d-4b13-a891-84eac406ed74 opened 5 years ago

ad12ffa1-c51d-4b13-a891-84eac406ed74 commented 5 years ago
BPO 37095
Nosy @gustaebel, @lilydjwg, @serhiy-storchaka, @animalize, @websurfer5, @erlend-aasland

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library', '3.10'] title = '[Feature Request]: Add zstd support in tarfile' updated_at = user = 'https://bugs.python.org/evan0greenup' ``` bugs.python.org fields: ```python activity = actor = 'yan12125' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'evan0greenup' dependencies = [] files = [] hgrepos = [] issue_num = 37095 keywords = [] message_count = 7.0 messages = ['343945', '356498', '373583', '373634', '374123', '375472', '376095'] nosy_count = 11.0 nosy_names = ['lars.gustaebel', 'daniel.ugra', 'lilydjwg', 'serhiy.storchaka', 'wicher', 'malin', 'Jeffrey.Kintscher', 'evan0greenup', 'erlendaasland', 'Jerrod Frost', 'Anatol Pomozov'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue37095' versions = ['Python 3.10'] ```

ad12ffa1-c51d-4b13-a891-84eac406ed74 commented 5 years ago

Zstandard is getting more and more popular. It could be awesome if tarfile support this compression format for .tar.zst file.

81154dec-86fe-49e4-a2f5-f8bf9dea6508 commented 4 years ago

Curious about this as well.

94eeb5c7-c1c3-4027-a0c9-334277ba635a commented 4 years ago

Is there any progress with this feature development?

Arch Linux uses Python tar library for its toolset. Arch devs are looking to add ZSTD support to the toolset but it needs this feature to be implemented.

b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 4 years ago

Add zstd support in tarfile

This requires the stdlib to contain a Zstandard module.

You can ask in the Idea forum: https://discuss.python.org/c/ideas

serhiy-storchaka commented 4 years ago

The tarfile module supports arbitrary compressions by using the stream mode. You only need to use a third-party library which provides zstd support.

Recent versions of the tar utility has options to explicit support of new compressions: --lzip, --lzma, --lzop, --zstd, so corresponding modes can be added to the tarfile module. But it needs to include the support of these compressions in the stdlib. It should be discussed on the Python-ideas mailing list.

https://mail.python.org/mailman3/lists/python-ideas.python.org/

b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 4 years ago

There are two zstd modules on pypi:

https://pypi.org/project/zstd/
https://pypi.org/project/zstandard/

The first one is too simple.

The second one is powerful, but has too many APIs: ZstdCompressorIterator ZstdDecompressorIterator ZstdCompressionReader ZstdCompressionWriter ZstdCompressionChunkerIterator (multi-thread compression)

IMO these are not necessary for stdlib.

In addition, it needs to add something, such as the max_length parameter, and a ZstdFile class that can be integrated with the tarfile module. These workloads are not big.

I looked at the zstd API, it's a bit simpler than lzma/bz2/zlib. If spend a month, should be able to make a zstd module for stdlib. Then discuss the detailed API on Python-Ideas.

I once wanted to do this job, but it seems my time does not allow it. If anyone wants to do this work, please reply here.

FYI, Python 3.10 schedule: 3.10.0 beta 1: 2021-05-03 (No new features beyond this point.)

b8e5a65c-bed4-4c5d-ba32-799f883ba638 commented 4 years ago

I have spent two weeks, almost complete the code, a preview: https://github.com/animalize/cpython/pull/8/files

Write directly for stdlib, since there are already zstd modules on pypi. In addition, the API of zstd is simple, not as complicated as lzma.

Can also use these: 1, argument clinic 2, multi-phase init

  1. internal function _PyLong_AsInt
Techcable commented 2 years ago

@animalize wrote a pyzstd module that closely matches the gzip/lama API

The other main contender zstandard is very advanced, but doesn't try to adapt to the stdlib tarfile API....

dralley commented 1 year ago

@animalize The PR you created is between branches on your own fork, is there any chance you could submit that PR against CPython upstream?

lgommans commented 9 months ago

Was looking whether zstd support was being worked on or if I could help, similar to the existing bzip and related modules that are super convenient to have in stdlib (thanks to whoever made those, in case they're around!). Happy to see u/animalize worked on it but... their user is deleted now? :(

Does anyone have a copy of the code and know what license it was under?

Edit: I also signed up for and replied on the related discourse forum thread: https://discuss.python.org/t/integrate-zstd-compression-in-tarfile-module/7013

dralley commented 9 months ago

So, uh, by proxy does that mean that pyzstd is now unmaintained? Seems like it would, he's the only maintainer.

I dunno if perhaps someone at Github could return an archive of that repo / PR?

Worst case, the source code tarball can be downloaded from PyPI and then the PR turning it into a module can be rewritten. The license is declared as 3-clause BSD.

hugovk commented 9 months ago

@lgommans You can see animalize's changes on the Wayback Machine (be patient, it takes a while to load):

https://web.archive.org/web/20231214201705/https://github.com/animalize/cpython/pull/8/files

hugovk commented 9 months ago

@dralley https://web.archive.org/web/20231126145554/https://github.com/animalize/pyzstd shows the repo was still active at least as late as November 2023, and had two other contributors. Checking their forks, and poking around some other links:

hauntsaninja commented 9 months ago

animalize was definitely gone by mid-December (I tried to look it up). I use indygreg's zstandard. The documentation buries the one-shot APIs a little, but they work great.

dralley commented 9 months ago

@lgommans You can potentially download the latest release from PyPI (tarball) and work from that.

Unfortunately there's a fair number of changes in 2023 that aren't captured by any of the forks.

zstandard is a great library but it doesn't mesh quite as well with the stdlib style.

hauntsaninja commented 9 months ago

zstandard does have simple one-shot APIs: zstandard.compress / zstandard.decompress. Its documentation just buries them a little. Unless you meant something else?

helmutg commented 8 months ago

For those still searching for a quick solution (based on zstandard):

class TarFile(tarfile.TarFile):
    """Subclass of tarfile.TarFile that can read and write zstd compressed archives."""

    OPEN_METH = {"zst": "zstopen"} | tarfile.TarFile.OPEN_METH

    @classmethod
    def zstopen(
        cls,
        name: str,
        mode: typing.Literal["r", "w", "x"] = "r",
        fileobj: None = None,
    ) -> tarfile.TarFile:
        if mode not in ("r", "w", "x"):
            raise NotImplementedError(f"mode `{mode}' not implemented for zst")
        if fileobj is not None:
            raise NotImplementedError("zst does not support a fileobj yet")
        try:
            import zstandard
        except ImportError:
            raise tarfile.CompressionError("zstandard module not available")
        if mode == "r":
            zfobj = zstandard.open(name, "rb")
        else:
            zfobj = zstandard.open(
                name,
                mode + "b",
                cctx=zstandard.ZstdCompressor(write_checksum=True, threads=-1),
            )
        try:
            tarobj = cls.taropen(name, mode, zfobj)
        except (OSError, EOFError, zstandard.ZstdError) as exc:
            zfobj.close()
            if mode == "r":
                raise tarfile.ReadError("not a zst file") from exc
            raise
        except:
            zfobj.close()
            raise
        # Setting the _extfileobj attribute is important to signal a need to
        # close this object and thus flush the compressed stream.
        # Unfortunately, tarfile.pyi doesn't know about it.
        tarobj._extfileobj = False  # type: ignore
        return tarobj

This is not perfect and does not handle file objects, but it may be good enough for some use cases. I am the author of this code and explicitly grant a MIT license on it as the original tarfile.py also is MIT licensed.

nanonyme commented 3 months ago

The tarfile module supports arbitrary compressions by using the stream mode. You only need to use a third-party library which provides zstd support.

Recent versions of the tar utility has options to explicit support of new compressions: --lzip, --lzma, --lzop, --zstd, so corresponding modes can be added to the tarfile module. But it needs to include the support of these compressions in the stdlib. It should be discussed on the Python-ideas mailing list.

https://mail.python.org/mailman3/lists/python-ideas.python.org/

Doesn't tarfile say "However, such a TarFile object is limited in that it does not allow random access" for this stream mode? So while it may be sufficient, there are significant limitations compared to real zstd support.