When Heritrix 3.4.0-IIPC-NAS-SNAPSHOT-2019-11-20T10:01:48Z parses <img srcset="foo.jpg 720w, bar.jpg 1080w" ...> it extracts the URLs as foo.jpg 720w and bar.jpg 1080w instead of the correct foo.jpg and bar.jpg. This causes the harvest of e.g. http://example.com/img/foo.jpg%20720w which some webservers handle by delivering foo.jpg. Unfortunately the WARC-Target-URI will still have the %20720p part so playback of the originating page won't work.
There should be a mechanism for rewriting field-specific content to compensate for errors in the source WARCs. In the above example a regexp-based replacer or trimmer would work well.
Currently a weak version of the field-specific processing is available as maximum field length. Extending this to allow for other processing seems like a valid way forward.
When Heritrix 3.4.0-IIPC-NAS-SNAPSHOT-2019-11-20T10:01:48Z parses
<img srcset="foo.jpg 720w, bar.jpg 1080w" ...>
it extracts the URLs asfoo.jpg 720w
andbar.jpg 1080w
instead of the correctfoo.jpg
andbar.jpg
. This causes the harvest of e.g.http://example.com/img/foo.jpg%20720w
which some webservers handle by deliveringfoo.jpg
. Unfortunately theWARC-Target-URI
will still have the%20720p
part so playback of the originating page won't work.There should be a mechanism for rewriting field-specific content to compensate for errors in the source WARCs. In the above example a regexp-based replacer or trimmer would work well.
Currently a weak version of the field-specific processing is available as maximum field length. Extending this to allow for other processing seems like a valid way forward.