Open grubberr opened 2 years ago
What is the desired behavior here?
in really it's good question I just pointed on inconsistency
Came across this while trying to solve a problem using smart_open
to read from a range of different URLs.
My code:
with (
so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
so.open(destination, 'wb') as fout
):
fout.write(fin.read())
I observed that for some URLs I was able to get a meaningful output file while in other cases it was just gibberish. Comparing between success and failure I determined that the ones that were failing were those with Content-Encoding: gzip
in the response headers.
@grubberr your issue helped pinpoint what was going on; changing my code to the following now works for all URLs:
with (
so.open(source, 'rb', transport_params={'headers': HEADERS}) as fin,
so.open(destination, 'wb') as fout
):
while True:
chunk = fin.read(1024)
if not chunk:
break
fout.write(chunk)
I understand smart_open
uses the extension to determine compression. My failing URL is 'https://www.BCBSIL.com/aca-json/il/index_il.json'
so I guess smart_open
can't determine to use gzip to decompress. I tried using compression='.gz'
when opening the file, but it gave me the following error.
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 300, in read
return self._buffer.read(size)
File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 487, in read
if not self._read_gzip_header():
File "C:\Users\theog\AppData\Local\Programs\Python\Python39\lib\gzip.py", line 435, in _read_gzip_header
raise BadGzipFile('Not a gzipped file (%r)' % magic)
gzip.BadGzipFile: Not a gzipped file (b'{\n')
This really puzzled me for a while, but @grubberr 's explanation of result.read()
vs result.read(2)
helps explain this. It looks like gzip is reading in chunks (4th line of stack trace), so even though original content is compressed, gzip is getting the uncompressed (by requests
) content which causes it to raise an error.
What is the desired behavior here?
f.read()
as it is for f.read(n)
.Now that I know what the issue is and how to work around it, this is by no means a showstopper. I do want to say that smart_open
has really made my life much simpler, I appreciate all the work that has gone into this library!
Hello,
196 bytes - gzip compressed result 209 bytes - uncompressed result
This happened because: in 1-st case library uses
self.response.raw.read()
- it returns result as is from server, it's gzip compressed in 2-nd case library usesself.response.iter_content
- result uncompressed byrequests
libraryVersions
Checklist
Before you create the issue, please make sure you have: