ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

GZipped HTML in warcs is not handled as web pages #204

Closed tokee closed 5 years ago

tokee commented 5 years ago

During work with warcs from webrecorder it became apparent that a lot of web pages were not indexed properly. @thomasegense discovered that HTML-pages delivered over HTTP with GZip and stored directly as such are seen as application/gzip by warc_indexer and consequently no HTML-specific analysis (extraction of title, links etc) is performed.

Sample header is

HTTP/1.1 200 OK
Date: Wed, 13 Mar 2019 06:38:58 GMT
Server: Apache
Last-Modified: Tue, 20 Feb 2018 17:09:53 GMT
ETag: "2552-565a7e115e570-gzip"
Accept-Ranges: bytes
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 2233
Content-Type: text/html

This problem might extend to other file types that might be delivered using GZip compression over HTTP (PDFs? BMP?). A general solution would be to look for the Content-Encoding HTTP header and if present, uncompress the content before analyzing it.

thomasegense commented 5 years ago

I have fixed this is our local webrecorder branch for the warc-indexer. The HtmlFeatureParser and TikaPayloadAnalyser are given the inputstream. It is easy to detect if it is gzip'en. In that case it is just wrapped inside a unzipping inputstream. Notice the excluded minetypes are still ignored.