GZipped HTML in warcs is not handled as web pages

During work with warcs from webrecorder it became apparent that a lot of web pages were not indexed properly. @thomasegense discovered that HTML-pages delivered over HTTP with GZip and stored directly as such are seen as application/gzip by warc_indexer and consequently no HTML-specific analysis (extraction of title, links etc) is performed.

Sample header is

HTTP/1.1 200 OK
Date: Wed, 13 Mar 2019 06:38:58 GMT
Server: Apache
Last-Modified: Tue, 20 Feb 2018 17:09:53 GMT
ETag: "2552-565a7e115e570-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 2233
Content-Type: text/html

This problem might extend to other file types that might be delivered using GZip compression over HTTP (PDFs? BMP?). A general solution would be to look for the Content-Encoding HTTP header and if present, uncompress the content before analyzing it.

ukwa / webarchive-discovery

GZipped HTML in warcs is not handled as web pages #204