I know of a feed hosted on a server that always returns gzipped data even when given no Accept-Encoding header, and also when given a headers such as Accept-Encoding: identity which should disable gzip compression:
#!/usr/bin/env python3
import gzip
from http.server import BaseHTTPRequestHandler, HTTPServer
from io import BytesIO
import requests
from socketserver import ThreadingMixIn
response = requests.get("https://news.ycombinator.com/rss")
rss_content = response.content
rss_content_type = response.headers.get('Content-Type', 'text/html; charset=utf-8')
# Define request handler class
class SimpleHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/normal':
self.handle_normal()
elif self.path == '/gzip':
self.handle_gzip()
else:
self.send_response(404)
self.end_headers()
self.wfile.write(b'Not Found')
def handle_normal(self):
self.send_response(200)
self.send_header('Content-Type', rss_content_type)
self.send_header('Content-Length', str(len(rss_content)))
self.end_headers()
self.wfile.write(rss_content)
def handle_gzip(self):
# Gzip the content
buffer = BytesIO()
with gzip.GzipFile(fileobj=buffer, mode='wb') as gz_file:
gz_file.write(rss_content)
gzipped_content = buffer.getvalue()
# Send gzipped response
self.send_response(200)
self.send_header('Content-Type', rss_content_type)
self.send_header('Content-Encoding', 'gzip')
self.send_header('Content-Length', str(len(gzipped_content)))
self.end_headers()
self.wfile.write(gzipped_content)
# Define a threading mixin to handle multiple requests simultaneously
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
pass
# Define the server entry point
def run(server_class=ThreadedHTTPServer, handler_class=SimpleHandler, port=4444):
server_address = ('', port)
httpd = server_class(server_address, handler_class)
print(f'Starting server on port {port}...')
httpd.serve_forever()
if __name__ == '__main__':
run()
(Hackernews isn't the guilty server, I'm just using its RSS data as a proof of concept)
As a demo, curl will emit the gzipped data unless given --compressed (which appears to cause Accept-Encoding to be sent and auto-decompresses the response)
% curl http://127.0.0.1:4444/gzip
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
% curl -s http://127.0.0.1:4444/gzip | file -
/dev/stdin: gzip compressed data, last modified: Thu Oct 10 21:05:21 2024, max compression
% curl http://127.0.0.1:4444/gzip --compressed
<rss version="2.0"> [... SNIP ...]
Browsing to the URL in a browser will also auto-decompress the data.
And so it seems as though user agents should automatically decompress responses if the Content-Encoding response header is sent. curl is a bit of an outlier though, in that it doesn't do it by default.
As a server, it appears to be legal to return content using gzip (or any Content-Encoding) if the client does not give an Accept-Encoding header. rfc7231 (HTTP/1.1) 5.3.4 says:
A request without an Accept-Encoding header field implies that the user agent has no preferences regarding content-codings. Although this allows the server to use any content-coding in a response, it does not imply that the user agent will be able to correctly process all encodings.
It may not be legal to return data in contradiction with an Accept-Encoding header if given. For example, if the client gives Accept-Encoding: identity or Accept-Encoding: identity, *;q=0 then it may not be legal to use gzip. And so the server that's giving me grief might not be compliant. Regardless, rss-parser doesn't send Accept-Encoding in my testing.
rss-parser's parseURL() breaks for me when a server sends gzip-compressed data:
> (await parser.parseURL("http://172.18.0.1:4444/normal")).title;
'Hacker News'
> (await parser.parseURL("http://172.18.0.1:4444/gzip")).title;
Uncaught Error: Non-whitespace before first tag.
Line: 0
Column: 1
Char:
at error (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:658:10)
at strictFail (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:684:7)
at beginWhiteSpace (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:958:7)
at SAXParser.write (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:1013:11)
at exports.Parser.Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:327:31)
at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:5:59)
at /tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:33:22
at new Promise (<anonymous>)
at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:32:16)
I think it's trying to parse the gzip-compressed data as rss+xml and is failing.
I think rss-parser should either:
Send Accept-Encoding: identity or Accept-Encoding: identity, *;q=0 to discourage the server from sending compressed responses. This won't help me given I'm dealing with a server that always sends gzip-compressed data regardless of the header, but it'd be the easiest way to be more compliant with what is legal - which is for servers to send responses using any encoding if none was requested.
Handle responses using Content-Encoding similarly to how fetch() does. That is, decompress compressed responses before parsing as rss+xml. This would cover the case in which servers do not honour an Accept-Encoding that discourages anything but an uncompressed response.
The latter change would work better for me, and may work better for others depending on the prevalence of servers that always compress responses regardless of a client's Accept-Encoding.
I know of a feed hosted on a server that always returns gzipped data even when given no
Accept-Encoding
header, and also when given a headers such asAccept-Encoding: identity
which should disable gzip compression:This can be simulated using:
(Hackernews isn't the guilty server, I'm just using its RSS data as a proof of concept)
As a demo,
curl
will emit the gzipped data unless given--compressed
(which appears to causeAccept-Encoding
to be sent and auto-decompresses the response)But
fetch()
will auto-decompress the data:Browsing to the URL in a browser will also auto-decompress the data.
And so it seems as though user agents should automatically decompress responses if the
Content-Encoding
response header is sent.curl
is a bit of an outlier though, in that it doesn't do it by default.As a server, it appears to be legal to return content using gzip (or any
Content-Encoding
) if the client does not give anAccept-Encoding
header. rfc7231 (HTTP/1.1) 5.3.4 says:It may not be legal to return data in contradiction with an
Accept-Encoding
header if given. For example, if the client givesAccept-Encoding: identity
orAccept-Encoding: identity, *;q=0
then it may not be legal to usegzip
. And so the server that's giving me grief might not be compliant. Regardless,rss-parser
doesn't sendAccept-Encoding
in my testing.rss-parser
'sparseURL()
breaks for me when a server sends gzip-compressed data:I think it's trying to parse the gzip-compressed data as rss+xml and is failing.
I think rss-parser should either:
Accept-Encoding: identity
orAccept-Encoding: identity, *;q=0
to discourage the server from sending compressed responses. This won't help me given I'm dealing with a server that always sends gzip-compressed data regardless of the header, but it'd be the easiest way to be more compliant with what is legal - which is for servers to send responses using any encoding if none was requested.Content-Encoding
similarly to howfetch()
does. That is, decompress compressed responses before parsing as rss+xml. This would cover the case in which servers do not honour anAccept-Encoding
that discourages anything but an uncompressed response.The latter change would work better for me, and may work better for others depending on the prevalence of servers that always compress responses regardless of a client's
Accept-Encoding
.