internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.78k stars 757 forks source link

Better handling of extracted URIs that are "data URIs" (base64 encoded media) #422

Open kris-sigur opened 3 years ago

kris-sigur commented 3 years ago

Data URIs seem to trip up the Extractor module. Excerpt from log follows with truncated URI:

Jun 18, 2021 6:47:08 PM org.archive.modules.extractor.ExtractorSitemap recordOutlink
WARNING: URIException when recording outlink http://lindaskoli.is/--18K truncated-- (in thread 'ToeThread #54: http://lindaskoli.is/post-sitemap.xml'; in processor 'extractorSitemap')
org.apache.commons.httpclient.URIException: URI length > 2083: http://lindaskoli.is/-- 18K truncated again--
        at org.archive.url.UsableURIFactory.fixup(UsableURIFactory.java:357)
        at org.archive.url.UsableURIFactory.create(UsableURIFactory.java:301)
        at org.archive.net.UURIFactory.getInstance(UURIFactory.java:55)
        at org.archive.modules.extractor.Extractor.addRelativeToBase(Extractor.java:190)
        at org.archive.modules.extractor.ExtractorSitemap.recordOutlink(ExtractorSitemap.java:163)
        at org.archive.modules.extractor.ExtractorSitemap.innerExtract(ExtractorSitemap.java:105)
        at org.archive.modules.extractor.ContentExtractor.extract(ContentExtractor.java:37)
        at org.archive.modules.extractor.Extractor.innerProcess(Extractor.java:102)
        at org.archive.modules.Processor.innerProcessResult(Processor.java:175)
        at org.archive.modules.Processor.process(Processor.java:142)
        at org.archive.modules.ProcessorChain.process(ProcessorChain.java:131)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:148)

These can be quite large and, as seen above, are output into the log twice. In addition to any other remedies it might be best to modify the logging code to truncate offending URI in the log to avoid these excessive log spams. The above was only about 18K. The log of my last large scale crawl contains much larger examples.

More specifically useful would be for the Extractor class to detect data uris and either ignore them or pass them to a service that decodes them and does link extraction if the mimetype warrants it. Although I think data uris are rarely (ever?) used for mimetypes that might contain further URIs. My only experience of them has been images.

Some of this logic may belong in the underlying url libraries in webarchives-common.

ato commented 3 years ago

We probably need to make all the extractors use the outlink helper methods in the Extractor base classes consistently as there's a number of them that call curi.getOutlinks().add(link) directly. Then we can change the helper methods to ignore data URIs.

Might be nice to also change the outlink helpers to take a CharSequence instead of a String that way when possible they can filter out large URIs before they get copied to the heap as Strings.