internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.74k stars 755 forks source link

Commas in srcset-URLs are not handled correctly #458

Open grob opened 2 years ago

grob commented 2 years ago

Although #243 is merged, srcset-URLs with commas in them are still not parsed/rewritten correctly, see https://web.archive.org/web/*/https://orf.at/ for example.

The original URLs used in srcset attributes look like this: https://assets.orf.at/mims/2022/03/26/crops/w=875,q=90/1204287_opener_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=bad56ac4b6df02892d3bd744c8e9494d4fd72b50.

a complete srcset example used in this site:

<source media="(max-width: 600px)" srcset="https://assets.orf.at/mims/2022/03/26/crops/w=800,h=450,q=70/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=baff281a0ee94f81ed19d576f7eff4f0ed6e44c9 800w, https://assets.orf.at/mims/2022/03/26/crops/w=1280,h=720,q=60/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=735e42760bcc348a2afed7dde20a17bf2857caaf 1280w">

results in (see here):

<source media="(max-width: 600px)" srcset="https://web.archive.org/web/20220114214021im_/https://assets.orf.at/mims/2022/03/26/crops/w=800, /web/20220114214021im_/https://orf.at/stories/3243632/h=450, /web/20220114214021im_/https://orf.at/stories/3243632/q=70/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=baff281a0ee94f81ed19d576f7eff4f0ed6e44c9 800w, https://web.archive.org/web/20220114214021im_/https://assets.orf.at/mims/2022/03/26/crops/w=1280, /web/20220114214021im_/https://orf.at/stories/3243632/h=720, /web/20220114214021im_/https://orf.at/stories/3243632/q=60/1204282_master_429226_coronavirus_schule_tests_vorschau_v1_a.jpg?s=735e42760bcc348a2afed7dde20a17bf2857caaf 1280w">
ato commented 2 years ago

As this is about rewriting this is likely an issue with the (closed-source) Wayback replay software not with the Heritrix web crawler.