indygreg / python-zstandard

Python bindings to the Zstandard (zstd) compression library
BSD 3-Clause "New" or "Revised" License
481 stars 84 forks source link

Migration from pyzstd library #216

Closed Rogdham closed 3 months ago

Rogdham commented 3 months ago

Hello, as you may know, the author of the pyzstd library has deleted their GitHub profile (btw the doc may need updating as a result).

Users of that library will probably fallback to python-zstandard as a result. It may be worth it to help them in the migration, for example in listing the main usages of pyzstd and how to migrate to python-zstandard for each of them.


The main pain point I have identified is that pyzstd provides a ZstdFile class, for which migration is not straightforward.

# pyzstd
>>> import pyzstd
>>> f = pyzstd.ZstdFile("file.zst")
>>> f.read(5)
b'Hello'
>>> f.seek(2)
2
>>> f.read(4)
b'llo,'
>>> f.peek(5)
b' world!\n'
>>> f.fileno()
3

# zstandard
>>> import zstandard
>>> f = zstandard.open("file.zst")
>>> f.read(5)
b'Hello'
>>> f.seek(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OSError: cannot seek zstd decompression stream backwards
>>> f.peek(5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'zstd.ZstdDecompressionReader' object has no attribute 'peek'. Did you mean: 'seek'?
>>> g.fileno()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'zstd.ZstdDecompressionReader' object has no attribute 'fileno'

Maybe this could be ported to python-zstandard though. What do you think?

indygreg commented 3 months ago

.peek() might be doable. But reverse seeking scares me because of the hidden performance implications.

In order to implement reverse seeking, you effectively need to seek to start of file and then decompress until you get to the desired seek offset. Seek is intended to be a constant time operation. Seek in the presence of decompression is definitely not constant time. That's one of the reasons I didn't implement it.

IMO if someone wants to seek backwards, they can obtain a new file handle and seek forwards. This reinforces that backwards seeks are a performance footgun.

Or am I missing a use case necessitating backwards seeks?

gcflymoto commented 3 months ago

One use case would be compressed ztail. i.e., grabbing just the last N number of lines of a file.

PlatonB commented 3 months ago

My simplest indexer for huge bioinformatic tables is based on pyzstd's SeekableZstdFile. Will similar functionality be available in python-zstandard?

Rogdham commented 3 months ago

IMO if someone wants to seek backwards, they can obtain a new file handle and seek forwards. This reinforces that backwards seeks are a performance footgun.

No you are right, for backwards seeks we have no choice but to decompress again previous data (at least from the beginning of the closest frame). This is what pyzstd was doing (and what I did in python-xz also).

I agree that performance wise it is far from ideal, but from a usability perspective it's really useful. Like sometimes you just want to open some files from a .tar.zst archive and not being able to seek prevents you from doing operations easily (e.g. getting list of files in the archive before decompressing then in a second pass, reading files out of order, etc.).

My take on the matter is that a disclaimer about performance in the documentation is the way to go about it.

Rogdham commented 3 months ago

Hello all, I have been in contact with Ma Lin, the author of the pyzstd library.

The project has been fully transferred to me, and its new home is at https://github.com/Rogdham/pyzstd.

I have just released a new version shipping some (previously unreleased) changes from Ma Lin and updating the URLs.

As a result, this issue can be closed because pyzstd library is not dead anymore :tada: