ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
115 stars 25 forks source link

Add mechanism for custom adjustment of field content #256

Closed tokee closed 2 years ago

tokee commented 3 years ago

When Heritrix 3.4.0-IIPC-NAS-SNAPSHOT-2019-11-20T10:01:48Z parses <img srcset="foo.jpg 720w, bar.jpg 1080w" ...> it extracts the URLs as foo.jpg 720w and bar.jpg 1080w instead of the correct foo.jpg and bar.jpg. This causes the harvest of e.g. http://example.com/img/foo.jpg%20720w which some webservers handle by delivering foo.jpg. Unfortunately the WARC-Target-URI will still have the %20720p part so playback of the originating page won't work.

There should be a mechanism for rewriting field-specific content to compensate for errors in the source WARCs. In the above example a regexp-based replacer or trimmer would work well.

Currently a weak version of the field-specific processing is available as maximum field length. Extending this to allow for other processing seems like a valid way forward.