piskvorky / smart_open

Utils for streaming large files (S3, HDFS, gzip, bz2...)
MIT License
3.21k stars 383 forks source link

http module - incorrect reading gzip compressed stream #713

Open grubberr opened 2 years ago

grubberr commented 2 years ago

Hello,


import smart_open

url = "https://fonts.googleapis.com/css?family=Montserrat"
headers = {"Accept-encoding": "deflate, gzip"}

result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read()
print(len(buff))

result = smart_open.open(url, transport_params={"headers": headers}, mode="rb")
buff = result.read(2)
buff += result.read()
print(len(buff))

196
209

196 bytes - gzip compressed result 209 bytes - uncompressed result

This happened because: in 1-st case library uses self.response.raw.read() - it returns result as is from server, it's gzip compressed in 2-nd case library uses self.response.iter_content - result uncompressed by requests library

Versions

print(platform.platform())
Linux-5.14.0-1047-oem-x86_64-with-glibc2.31
print("Python", sys.version)
Python 3.9.11 (main, Aug  9 2022, 09:22:28) 
[GCC 9.4.0]
print("smart_open", smart_open.__version__)
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

mpenkov commented 2 years ago

What is the desired behavior here?

grubberr commented 2 years ago

in really it's good question I just pointed on inconsistency

theogaraj commented 10 months ago

Came across this while trying to solve a problem using smart_open to read from a range of different URLs.
My code:

with (
    so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
    so.open(destination, 'wb') as fout
):
    fout.write(fin.read())

I observed that for some URLs I was able to get a meaningful output file while in other cases it was just gibberish. Comparing between success and failure I determined that the ones that were failing were those with Content-Encoding: gzip in the response headers.

@grubberr your issue helped pinpoint what was going on; changing my code to the following now works for all URLs:

with (
    so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
    so.open(destination, 'wb') as fout
):
    while True:
        chunk = fin.read(1024)
        if not chunk:
            break

        fout.write(chunk)

I understand smart_open uses the extension to determine compression. My failing URL is 'https://www.BCBSIL.com/aca-json/il/index_il.json' so I guess smart_open can't determine to use gzip to decompress. I tried using compression='.gz' when opening the file, but it gave me the following error.

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 300, in read
    return self._buffer.read(size)
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 487, in read
    if not self._read_gzip_header():
  File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 435, in _read_gzip_header
    raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'{\n')

This really puzzled me for a while, but @grubberr 's explanation of result.read() vs result.read(2) helps explain this. It looks like gzip is reading in chunks (4th line of stack trace), so even though original content is compressed, gzip is getting the uncompressed (by requests) content which causes it to raise an error.

What is the desired behavior here?

Now that I know what the issue is and how to work around it, this is by no means a showstopper. I do want to say that smart_open has really made my life much simpler, I appreciate all the work that has gone into this library!