Open anjackson opened 5 years ago
However, looking at the code in question, it appears that the ExtractorHTML
extracts links that might be URLs from any <meta content="..."
attribute except for property="robots"
or property="refresh"
:
I think, in general this won't happen with textual content
attributes, but in this case the domain-name form appears to be causing this to be judged isVeryLikelyUri(...) == true
.
Hence, I'm not sure how often this problem will really turn up - it may not be worth worrying about.
However, for common properties that are known not to be used for absolute or relative URLs of any sort, the ExtractorHTML
class could be modified to skip this speculative link extraction.
Apparently this happens a lot with og:facebook-tags
attributes.
Perhaps given the change in usage of these fields in recent years, it's time to change the default behaviour to avoid this speculative link extraction?
This really happens very often and would be a great fix saving a lot of bandwidth and trouble. E. g. when crawling www.klausenstein.at an automatic abuse-report by this host is created because of this line in the page src:
<meta name="publisher" content="iNetWorker.at"/>
This causes heritrix to request http://www.klausenstein.at/iNetWorker.at which is interpreted as a crawler-trap and results in an abuse-report. We faced lots of similar situations with something like
<meta name="publisher" content="domain.com"/>
...
Unfortunately the problems are increasing more and more, this tag also causes problems:
<meta name="twitter:domain" content="Drivingthenation.com" />
It is placed on every page of the domain and generates an additional invalid call (404) of the form "current URL + Drivingthenation.com" for every single page request, which leads to thousands of additional invalid requests with 404 return code. For instance www.drivingthenation.com/category/automobilesandenergy/ "links" to www.drivingthenation.com/category/automobilesandenergy/Drivingthenation.com and so on. But all these "linked" pages do not exist.
It would be very helpful if a solution could be found for this problem in the near future. These incorrectly extracted URLs lead to great frustration for webmasters. It's always the content="domain.com" attribute which most likely is never a link!?
In my opinion, this URL guessing approach by parsing javascript content must die completely. This easily causes hundreds of RPM of not found errors, which often triggers alerts. Whoever thought that this is a good approach has probably never hosted or monitored anything.
Following this report of a URL being constructed from
<meta>
elements: