Closed tokee closed 5 years ago
I have fixed this is our local webrecorder branch for the warc-indexer. The HtmlFeatureParser and TikaPayloadAnalyser are given the inputstream. It is easy to detect if it is gzip'en. In that case it is just wrapped inside a unzipping inputstream. Notice the excluded minetypes are still ignored.
During work with
warc
s fromwebrecorder
it became apparent that a lot of web pages were not indexed properly. @thomasegense discovered that HTML-pages delivered over HTTP with GZip and stored directly as such are seen asapplication/gzip
bywarc_indexer
and consequently no HTML-specific analysis (extraction oftitle
,links
etc) is performed.Sample header is
This problem might extend to other file types that might be delivered using GZip compression over HTTP (PDFs? BMP?). A general solution would be to look for the
Content-Encoding
HTTP header and if present, uncompress the content before analyzing it.