rbren / rss-parser

A lightweight RSS parser, for Node and the browser
MIT License
1.38k stars 209 forks source link

parseURL doesn't send Accept-Encoding and fails to handle alternate Content-Encoding (e.g. gzip) #280

Open justinsteven opened 1 month ago

justinsteven commented 1 month ago

I know of a feed hosted on a server that always returns gzipped data even when given no Accept-Encoding header, and also when given a headers such as Accept-Encoding: identity which should disable gzip compression:

% curl -s https://server/ | file -
/dev/stdin: gzip compressed data

% curl -s https://server/ -H 'Accept-Encoding: identity' | file -
/dev/stdin: gzip compressed data

% curl -s https://server/ -H 'Accept-Encoding: identity, *;q=0' | file -
/dev/stdin: gzip compressed data

This can be simulated using:

#!/usr/bin/env python3
import gzip
from http.server import BaseHTTPRequestHandler, HTTPServer
from io import BytesIO
import requests
from socketserver import ThreadingMixIn

response = requests.get("https://news.ycombinator.com/rss")
rss_content = response.content
rss_content_type = response.headers.get('Content-Type', 'text/html; charset=utf-8')

# Define request handler class
class SimpleHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/normal':
            self.handle_normal()
        elif self.path == '/gzip':
            self.handle_gzip()
        else:
            self.send_response(404)
            self.end_headers()
            self.wfile.write(b'Not Found')

    def handle_normal(self):
        self.send_response(200)
        self.send_header('Content-Type', rss_content_type)
        self.send_header('Content-Length', str(len(rss_content)))
        self.end_headers()
        self.wfile.write(rss_content)

    def handle_gzip(self):
        # Gzip the content
        buffer = BytesIO()
        with gzip.GzipFile(fileobj=buffer, mode='wb') as gz_file:
            gz_file.write(rss_content)
        gzipped_content = buffer.getvalue()

        # Send gzipped response
        self.send_response(200)
        self.send_header('Content-Type', rss_content_type)
        self.send_header('Content-Encoding', 'gzip')
        self.send_header('Content-Length', str(len(gzipped_content)))
        self.end_headers()
        self.wfile.write(gzipped_content)

# Define a threading mixin to handle multiple requests simultaneously
class ThreadedHTTPServer(ThreadingMixIn, HTTPServer):
    pass

# Define the server entry point
def run(server_class=ThreadedHTTPServer, handler_class=SimpleHandler, port=4444):
    server_address = ('', port)
    httpd = server_class(server_address, handler_class)
    print(f'Starting server on port {port}...')
    httpd.serve_forever()

if __name__ == '__main__':
    run()

(Hackernews isn't the guilty server, I'm just using its RSS data as a proof of concept)

As a demo, curl will emit the gzipped data unless given --compressed (which appears to cause Accept-Encoding to be sent and auto-decompresses the response)

% curl http://127.0.0.1:4444/gzip
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.

% curl -s http://127.0.0.1:4444/gzip | file -
/dev/stdin: gzip compressed data, last modified: Thu Oct 10 21:05:21 2024, max compression

% curl http://127.0.0.1:4444/gzip --compressed
<rss version="2.0"> [... SNIP ...]

But fetch() will auto-decompress the data:

> await (await fetch("http://172.18.0.1:4444/gzip")).text()
'<rss version="2.0"> [... SNIP ...]

Browsing to the URL in a browser will also auto-decompress the data.

And so it seems as though user agents should automatically decompress responses if the Content-Encoding response header is sent. curl is a bit of an outlier though, in that it doesn't do it by default.

As a server, it appears to be legal to return content using gzip (or any Content-Encoding) if the client does not give an Accept-Encoding header. rfc7231 (HTTP/1.1) 5.3.4 says:

A request without an Accept-Encoding header field implies that the user agent has no preferences regarding content-codings. Although this allows the server to use any content-coding in a response, it does not imply that the user agent will be able to correctly process all encodings.

It may not be legal to return data in contradiction with an Accept-Encoding header if given. For example, if the client gives Accept-Encoding: identity or Accept-Encoding: identity, *;q=0 then it may not be legal to use gzip. And so the server that's giving me grief might not be compliant. Regardless, rss-parser doesn't send Accept-Encoding in my testing.

rss-parser's parseURL() breaks for me when a server sends gzip-compressed data:

> (await parser.parseURL("http://172.18.0.1:4444/normal")).title;
'Hacker News'

> (await parser.parseURL("http://172.18.0.1:4444/gzip")).title;
Uncaught Error: Non-whitespace before first tag.
Line: 0
Column: 1
Char:
    at error (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:658:10)
    at strictFail (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:684:7)
    at beginWhiteSpace (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:958:7)
    at SAXParser.write (/tmp/tmp.H1iAsG3355/node_modules/sax/lib/sax.js:1013:11)
    at exports.Parser.Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:327:31)
    at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/xml2js/lib/parser.js:5:59)
    at /tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:33:22
    at new Promise (<anonymous>)
    at Parser.parseString (/tmp/tmp.H1iAsG3355/node_modules/rss-parser/lib/parser.js:32:16)

I think it's trying to parse the gzip-compressed data as rss+xml and is failing.

I think rss-parser should either:

The latter change would work better for me, and may work better for others depending on the prevalence of servers that always compress responses regardless of a client's Accept-Encoding.