Open ad12ffa1-c51d-4b13-a891-84eac406ed74 opened 5 years ago
Zstandard is getting more and more popular. It could be awesome if tarfile support this compression format for .tar.zst file.
Curious about this as well.
Is there any progress with this feature development?
Arch Linux uses Python tar library for its toolset. Arch devs are looking to add ZSTD support to the toolset but it needs this feature to be implemented.
Add zstd support in tarfile
This requires the stdlib to contain a Zstandard module.
You can ask in the Idea forum: https://discuss.python.org/c/ideas
The tarfile module supports arbitrary compressions by using the stream mode. You only need to use a third-party library which provides zstd support.
Recent versions of the tar utility has options to explicit support of new compressions: --lzip, --lzma, --lzop, --zstd, so corresponding modes can be added to the tarfile module. But it needs to include the support of these compressions in the stdlib. It should be discussed on the Python-ideas mailing list.
https://mail.python.org/mailman3/lists/python-ideas.python.org/
There are two zstd modules on pypi:
https://pypi.org/project/zstd/
https://pypi.org/project/zstandard/
The first one is too simple.
The second one is powerful, but has too many APIs: ZstdCompressorIterator ZstdDecompressorIterator ZstdCompressionReader ZstdCompressionWriter ZstdCompressionChunkerIterator (multi-thread compression)
IMO these are not necessary for stdlib.
In addition, it needs to add something, such as the max_length
parameter, and a ZstdFile
class that can be integrated with the tarfile module. These workloads are not big.
I looked at the zstd API, it's a bit simpler than lzma/bz2/zlib. If spend a month, should be able to make a zstd module for stdlib. Then discuss the detailed API on Python-Ideas.
I once wanted to do this job, but it seems my time does not allow it. If anyone wants to do this work, please reply here.
FYI, Python 3.10 schedule: 3.10.0 beta 1: 2021-05-03 (No new features beyond this point.)
I have spent two weeks, almost complete the code, a preview: https://github.com/animalize/cpython/pull/8/files
Write directly for stdlib, since there are already zstd modules on pypi. In addition, the API of zstd is simple, not as complicated as lzma.
Can also use these: 1, argument clinic 2, multi-phase init
@animalize The PR you created is between branches on your own fork, is there any chance you could submit that PR against CPython upstream?
Was looking whether zstd support was being worked on or if I could help, similar to the existing bzip and related modules that are super convenient to have in stdlib (thanks to whoever made those, in case they're around!). Happy to see u/animalize worked on it but... their user is deleted now? :(
Does anyone have a copy of the code and know what license it was under?
Edit: I also signed up for and replied on the related discourse forum thread: https://discuss.python.org/t/integrate-zstd-compression-in-tarfile-module/7013
So, uh, by proxy does that mean that pyzstd
is now unmaintained? Seems like it would, he's the only maintainer.
I dunno if perhaps someone at Github could return an archive of that repo / PR?
Worst case, the source code tarball can be downloaded from PyPI and then the PR turning it into a module can be rewritten. The license is declared as 3-clause BSD.
@lgommans You can see animalize's changes on the Wayback Machine (be patient, it takes a while to load):
https://web.archive.org/web/20231214201705/https://github.com/animalize/cpython/pull/8/files
@dralley https://web.archive.org/web/20231126145554/https://github.com/animalize/pyzstd shows the repo was still active at least as late as November 2023, and had two other contributors. Checking their forks, and poking around some other links:
animalize was definitely gone by mid-December (I tried to look it up). I use indygreg's zstandard. The documentation buries the one-shot APIs a little, but they work great.
@lgommans You can potentially download the latest release from PyPI (tarball) and work from that.
Unfortunately there's a fair number of changes in 2023 that aren't captured by any of the forks.
zstandard
is a great library but it doesn't mesh quite as well with the stdlib style.
zstandard
does have simple one-shot APIs: zstandard.compress
/ zstandard.decompress
. Its documentation just buries them a little. Unless you meant something else?
For those still searching for a quick solution (based on zstandard):
class TarFile(tarfile.TarFile):
"""Subclass of tarfile.TarFile that can read and write zstd compressed archives."""
OPEN_METH = {"zst": "zstopen"} | tarfile.TarFile.OPEN_METH
@classmethod
def zstopen(
cls,
name: str,
mode: typing.Literal["r", "w", "x"] = "r",
fileobj: None = None,
) -> tarfile.TarFile:
if mode not in ("r", "w", "x"):
raise NotImplementedError(f"mode `{mode}' not implemented for zst")
if fileobj is not None:
raise NotImplementedError("zst does not support a fileobj yet")
try:
import zstandard
except ImportError:
raise tarfile.CompressionError("zstandard module not available")
if mode == "r":
zfobj = zstandard.open(name, "rb")
else:
zfobj = zstandard.open(
name,
mode + "b",
cctx=zstandard.ZstdCompressor(write_checksum=True, threads=-1),
)
try:
tarobj = cls.taropen(name, mode, zfobj)
except (OSError, EOFError, zstandard.ZstdError) as exc:
zfobj.close()
if mode == "r":
raise tarfile.ReadError("not a zst file") from exc
raise
except:
zfobj.close()
raise
# Setting the _extfileobj attribute is important to signal a need to
# close this object and thus flush the compressed stream.
# Unfortunately, tarfile.pyi doesn't know about it.
tarobj._extfileobj = False # type: ignore
return tarobj
This is not perfect and does not handle file objects, but it may be good enough for some use cases. I am the author of this code and explicitly grant a MIT license on it as the original tarfile.py also is MIT licensed.
The tarfile module supports arbitrary compressions by using the stream mode. You only need to use a third-party library which provides zstd support.
Recent versions of the tar utility has options to explicit support of new compressions: --lzip, --lzma, --lzop, --zstd, so corresponding modes can be added to the tarfile module. But it needs to include the support of these compressions in the stdlib. It should be discussed on the Python-ideas mailing list.
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Doesn't tarfile say "However, such a TarFile object is limited in that it does not allow random access" for this stream mode? So while it may be sufficient, there are significant limitations compared to real zstd support.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', 'library', '3.10']
title = '[Feature Request]: Add zstd support in tarfile'
updated_at =
user = 'https://bugs.python.org/evan0greenup'
```
bugs.python.org fields:
```python
activity =
actor = 'yan12125'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'evan0greenup'
dependencies = []
files = []
hgrepos = []
issue_num = 37095
keywords = []
message_count = 7.0
messages = ['343945', '356498', '373583', '373634', '374123', '375472', '376095']
nosy_count = 11.0
nosy_names = ['lars.gustaebel', 'daniel.ugra', 'lilydjwg', 'serhiy.storchaka', 'wicher', 'malin', 'Jeffrey.Kintscher', 'evan0greenup', 'erlendaasland', 'Jerrod Frost', 'Anatol Pomozov']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue37095'
versions = ['Python 3.10']
```