theelous3 / asks

Async requests-like httplib for python.
MIT License
508 stars 64 forks source link

Streaming / CallBack does not uncompress first?? #95

Closed bradwood closed 5 years ago

bradwood commented 5 years ago

Hi @theelous3

I suspect I'm getting output from this that is still compressed... Is this a bug or a feature?

My code (copied from your docs, pretty much verbatim)

        async def chunk_processor(bytechunk):
            async with await trio.open_file(newfile, 'ab') as output_file:
                await output_file.write(bytechunk)
                LOGGER.debug(f'Wrote file chunk size = {len(bytechunk)}')

        resp = await asks.get(str(self._url), callback=chunk_processor)
bradwood commented 5 years ago

Update... I've validated that this is indeed the behaviour -- if I do: cat file| gunzip I get the data back no problem... Same with stream=True.

theelous3 commented 5 years ago

Hi. The callback is just to allow you to do whatever you like with the raw bytes coming in. There is no default callback function, so really the "default behavior" is to do literally nothing at all!

On your question of - is on the fly decompression even possible - the answer is "sort of". It's possible the way you see the stream argument doing it. You can reimplement that in a callback if you want, or just use the stream arg.

Just to be clear, a callback is a function to be run on a certain event. You supply the function. The one given in the example does nothing more than write whatever comes in to file.

bradwood commented 5 years ago

ok... i think i follow you -- how can I tell if the payload is gzipped or not before attempting unzipping it? And, as a side question, do you not think it would be a cleaner if the method returned the unzipped content as a default as the other invocation does?

theelous3 commented 5 years ago

Ok, lemme give you a silly example.

async def totally_useless_callback(bytes):
    print('NOM NOM NOM')

If you pass this as a callback, it will totally work, but just print NOM NOM NOM every time we read in bytes and do nothing useful. You can pass whatever function you want as long as it takes at least one argument.

The callback argument is to allow people to do whatever crazy shit they'd like, without restriction. It can be as useless or as useful as you make it.

Even if I did want to enforce something like decompression, I couldn't, as I don't control what functions people pass as a callback. You are probably best served by the stream param, as it's what you want in like 99.99999999% of use cases :)

You can tell if the payload is compressed by checking the headers for content-encoding

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Encoding

bradwood commented 5 years ago

Hrm... So I tried the stream option like so:

        resp = await asks.get(str(self._url), stream=True)
        async with await trio.open_file(newfile, 'ab') as output_file:
            async with resp.body:
                async for bytechunk in resp.body:
                    await output_file.write(bytechunk)

Still seems to be barfing on the gzip?

Traceback (most recent call last):
  File "src/pyskyq/examples/cli_epg.py", line 70, in <module>
    trio.run(main, sys.argv[1:])
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/trio/_core/_run.py", line 1337, in run
    raise runner.main_task_outcome.error
  File "src/pyskyq/examples/cli_epg.py", line 49, in main
    nursery.start_soon(all_72_hour.fetch)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/Users/brad/Code/pyskyq/src/pyskyq/xmltvlisting.py", line 189, in fetch
    async for bytechunk in resp.body:
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/async_generator/_impl.py", line 366, in step
    return await ANextIter(self._it, start_fn, *args)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/async_generator/_impl.py", line 197, in __next__
    return self._invoke(first_fn, *first_args)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/async_generator/_impl.py", line 209, in _invoke
    result = fn(*args)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/asks/response_objects.py", line 130, in __aiter__
    event.data = decompressor.send(event.data)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/asks/http_utils.py", line 36, in decompress
    data = _compression_mapping[compression](data)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 532, in decompress
    return f.read()
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Any further thoughts?

I could certainly write some code to uncompress this myself, if needed via the callback mechanism, but if stream=True is meant to decode on the fly, then this is a bug, no?

Cheers

Brad

theelous3 commented 5 years ago

Hm. Might be a bug. Can you show the output of print(resp.headers) ?

bradwood commented 5 years ago

Headers on line 2 of below dump.

[2018-11-01 21:05:27,016] DEBUG:pyskyq.xmltvlisting:Fetch(<XMLTVListing: url='http://www.xmltv.co.uk/feed/6715', path='.epg_data', filename='42a4b30993795c4efc92cdc93d5c10d5e5968baa255a8d85d8cee691b7319cbf.xml'>) call started.
{'server': 'nginx/1.11.10', 'date': 'Thu, 01 Nov 2018 21:05:27 GMT', 'content-type': 'application/xml', 'transfer-encoding': 'chunked', 'connection': 'keep-alive', 'last-modified': 'Thu, 01 Nov 2018 02:35:57 GMT', 'etag': '"112fe98-57991474d8431-gzip"', 'accept-ranges': 'bytes', 'vary': 'Accept-Encoding', 'content-encoding': 'gzip'}
Traceback (most recent call last):
  File "src/pyskyq/examples/cli_epg.py", line 70, in <module>
    trio.run(main, sys.argv[1:])
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/trio/_core/_run.py", line 1337, in run
    raise runner.main_task_outcome.error
  File "src/pyskyq/examples/cli_epg.py", line 49, in main
    nursery.start_soon(all_72_hour.fetch)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/trio/_core/_run.py", line 397, in __aexit__
    raise combined_error_from_nursery
  File "/Users/brad/Code/pyskyq/src/pyskyq/xmltvlisting.py", line 219, in fetch
    async for bytechunk in resp.body:
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/async_generator/_impl.py", line 366, in step
    return await ANextIter(self._it, start_fn, *args)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/async_generator/_impl.py", line 197, in __next__
    return self._invoke(first_fn, *first_args)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/async_generator/_impl.py", line 209, in _invoke
    result = fn(*args)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/asks/response_objects.py", line 130, in __aiter__
    event.data = decompressor.send(event.data)
  File "/Users/brad/.virtualenvs/pyskyq-4vSEKDfZ/lib/python3.7/site-packages/asks/http_utils.py", line 36, in decompress
    data = _compression_mapping[compression](data)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 532, in decompress
    return f.read()
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/usr/local/Cellar/python/3.7.0/Frameworks/Python.framework/Versions/3.7/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
bradwood commented 5 years ago

I did this, it's ugly but it works: https://gitlab.com/bradwood/pyskyq/blob/master/src/pyskyq/xmltvlisting.py#L189

theelous3 commented 5 years ago

Nicely caught. I've opened a new issue for this.