ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Avoid attempting to parse clearly irrelevant URIs #7

Open anjackson opened 7 years ago

anjackson commented 7 years ago

The web-renderer processor is attempting to parse all extracted links/references, and this can throw errors like:

 Could not parse as UURI: data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL

There's not point parsing data: or indeed mailto: URIs - we should probably default to whitelisting on http: and https: URIs.