Open svisser opened 1 year ago
What we do in xopen is simply put all GzipFile stuff in a io.BufferedWriter and return that to the user. Should be a one or two-liner fix for the smart_open people.
I'd like to plug xopen here, it has a much less broad focus than smart_open, but as a result it is very fast, especially for any gzip related work. I notice that the smart_open library does not use any accelerated libraries such as python-isal. Those can have great impact on performance.
@svisser Are you interested in making a PR to implement @rhpvorderman 's suggestion?
Is smart_open core the right place for such buffering? IMO that should either go directly in Python (as your CPython ticket suggests), or live in user-land as a custom registered extension handle (as you say).
But I can also see how that's a pretty common use-case, and smart_open does focus on performance. So it's not a bad fit. So I'm on the fence here. How far do we want to take this? Implement buffering / paralellized .bz2
, .xz
too? Support custom random-access seeking via igzip
? Then we may end up maintaining a lot of compatibility / 3rd party API stuff.
Not an issue with just buffered write to .gz
. Just thinking ahead about the core mission, scope.
I think just do the bare minimum to address the issue: add buffering to gzip as an edge case. Once the CPython guys fix this on their side we can remove the buffering on our side.
@mpenkov The lines in xopen are here https://github.com/pycompression/xopen/blob/cc75cbcd17130a4c004cf44c46121da51be582d8/src/xopen/__init__.py#LL1297C1-L1309C23
This fix could be really important for my similar use case, streaming large gzipped tar files from an HTTPS endpoint to an S3 bucket. The throughput I've been seeing with smart_open (6.4.0 with Python 3.10) is inadequate, taking around 130 seconds per ~50MB part, meaning over 8 hours to transfer an 11GB file!
Previously we'd streamed files to temporary storage before performing a multipart upload to S3. This was very quick, given sufficient disk space, as there is a lot less overhead than with the current smart_open implementation. Tackling a new larger dataset prompted the move to chaining the streams as the cloud service has disk space constraints that are not sufficient for the largest files.
Problem description
When writing a
.csv.gz
file to S3, performance was slow as it's callingGzipFile.write()
for every single line in the CSV file. This is not the issue itself but there is also a lack of buffering inGzipFile.write()
(see: https://github.com/python/cpython/issues/89550) and I was able to improve performance by implementing a solution similar to https://github.com/python/cpython/pull/101251 (i.e., register a custom compression handler for the'.gz'
file extension in whichGzipFile.write
does have buffering).I'm opening this issue for the smart_open library to discuss:
GzipFile.write()
for Python versions that don't yet have the above fix?Steps/code to reproduce the problem
Use
smart_open.open
in write mode for a.csv.gz
file to S3:(
writerows
, in C, callswriterow
for each line, which in turn callGzip.write()
for each line)Versions