ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Droid WARC URL header sanitize #198

Closed tokee closed 5 years ago

tokee commented 6 years ago

The tool wget produces WARC-files with the values for the WARC-header WARC-Target-URI encapsulated in <>. Retrieving the URL from the WARC header in warc-indexer can be done safely using Normalisation.sanitiseWARCHeaderValue but this was not done for DroidDetectorAnalyser. This pull request fixes that.

This is an oversight that I am sure will come back to bite us, so I have raised issue #197.