Better handling of extracted URIs that are "data URIs" (base64 encoded media)

Data URIs seem to trip up the Extractor module. Excerpt from log follows with truncated URI:

Jun 18, 2021 6:47:08 PM org.archive.modules.extractor.ExtractorSitemap recordOutlink
WARNING: URIException when recording outlink http://lindaskoli.is/data:image/jpeg;base64,/9j/4AAQ--18K truncated-- (in thread 'ToeThread #54: http://lindaskoli.is/post-sitemap.xml'; in processor 'extractorSitemap')
org.apache.commons.httpclient.URIException: URI length > 2083: http://lindaskoli.is/data:image/jpeg;base64,/9j/4AAQ-- 18K truncated again--
        at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:357)
        at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:301)
        at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
        at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
        at org.archive.modules.extractor.ExtractorSitemap.recordOutlink(ExtractorSitemap.java:163)
        at org.archive.modules.extractor.ExtractorSitemap.innerExtract(ExtractorSitemap.java:105)
        at org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
        at org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

These can be quite large and, as seen above, are output into the log twice. In addition to any other remedies it might be best to modify the logging code to truncate offending URI in the log to avoid these excessive log spams. The above was only about 18K. The log of my last large scale crawl contains much larger examples.

More specifically useful would be for the Extractor class to detect data uris and either ignore them or pass them to a service that decodes them and does link extraction if the mimetype warrants it. Although I think data uris are rarely (ever?) used for mimetypes that might contain further URIs. My only experience of them has been images.

Some of this logic may belong in the underlying url libraries in webarchives-common.

internetarchive / heritrix3

Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422