internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.74k stars 755 forks source link

Avoid speculative links extraction for meta fields known not to contain links #225

Open anjackson opened 5 years ago

anjackson commented 5 years ago

Following this report of a URL being constructed from <meta> elements:

I'm using heritrix 3.3.0-SNAPSHOT and see some strange behavior in the link extraction. This is one example in crawl.log:

2018-12-21T04:07:03.874Z   404       7161 https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com RLX https://stitch-maps.com/news/2018/10/twofer/ text/html #116 20181219040702090+1782 sha1:K7HLTQ7SFI4KAQN3NVAO4OJ4UBYT3FGE - -

There isn't any link to the crawled url on the given src page, so it seems like the facebook tags on the src page have something to do with it:

<meta property="og:url" content="http://stitch-maps.com/news/2018/10/twofer/"/>
<meta property="og:site_name" content="Stitch-Maps.com"/>

Isn't it a bug, that heritrix combined these two urls to https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com?

anjackson commented 5 years ago

However, looking at the code in question, it appears that the ExtractorHTML extracts links that might be URLs from any <meta content="..." attribute except for property="robots" or property="refresh":

https://github.com/internetarchive/heritrix3/blob/a83167619604926b1c8aebfef5e21271ad64eeaa/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java#L990-L996

I think, in general this won't happen with textual content attributes, but in this case the domain-name form appears to be causing this to be judged isVeryLikelyUri(...) == true.

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/commons/src/main/java/org/archive/util/UriUtils.java#L394-L469

Hence, I'm not sure how often this problem will really turn up - it may not be worth worrying about.

However, for common properties that are known not to be used for absolute or relative URLs of any sort, the ExtractorHTML class could be modified to skip this speculative link extraction.

anjackson commented 5 years ago

Apparently this happens a lot with og:facebook-tags attributes.

Perhaps given the change in usage of these fields in recent years, it's time to change the default behaviour to avoid this speculative link extraction?

ToRu82 commented 4 years ago

This really happens very often and would be a great fix saving a lot of bandwidth and trouble. E. g. when crawling www.klausenstein.at an automatic abuse-report by this host is created because of this line in the page src:

<meta name="publisher" content="iNetWorker.at"/>

This causes heritrix to request http://www.klausenstein.at/iNetWorker.at which is interpreted as a crawler-trap and results in an abuse-report. We faced lots of similar situations with something like <meta name="publisher" content="domain.com"/> ...

ToRu82 commented 4 years ago

Unfortunately the problems are increasing more and more, this tag also causes problems:

<meta name="twitter:domain" content="Drivingthenation.com" />

It is placed on every page of the domain and generates an additional invalid call (404) of the form "current URL + Drivingthenation.com" for every single page request, which leads to thousands of additional invalid requests with 404 return code. For instance www.drivingthenation.com/category/automobilesandenergy/ "links" to www.drivingthenation.com/category/automobilesandenergy/Drivingthenation.com and so on. But all these "linked" pages do not exist.

It would be very helpful if a solution could be found for this problem in the near future. These incorrectly extracted URLs lead to great frustration for webmasters. It's always the content="domain.com" attribute which most likely is never a link!?

mvaitkus commented 3 years ago

In my opinion, this URL guessing approach by parsing javascript content must die completely. This easily causes hundreds of RPM of not found errors, which often triggers alerts. Whoever thought that this is a good approach has probably never hosted or monitored anything.