seomoz / rep-cpp

Robot exclusion protocol in C++
MIT License
12 stars 5 forks source link

Ignore absoluteURI for Allow/Disallow directives for out of context domains #26

Closed b4hand closed 7 years ago

b4hand commented 7 years ago

While technically, Allow and Disallow directives should always be relative paths according to the spec, it's not entirely uncommon for some sites to include absolute URIs in these directives.

Traditionally, reppy and rep-cpp has considered absolute URIs to be equivalent to their corresponding relative URL as though they were for the current domain; however, there are several examples on the web where sites list an absolute URI for an external domain that doesn't match the domain of the parsed robots.txt file. It's unclear what the intended meaning of such an absolute URI is. Google's spec for robots.txt indicates only path elements for Disallow and Allow directives, and it's unclear how they handle absolute URIs in this context, but I can't imagine that Google would respect a Disallow directive for an external site since that would mean arbitrary external sites could block crawling for any site.

One simple option for handling this case would be to simply discard any directives with absolute URIs. However, this means that previously matching domain absolute URIs would now be ignored as well, including Disallow directives. I'm reluctant to permit this case, so instead, I propose we only ignore or discard directives where the domain doesn't match the requested domain for robots.txt. However, rep-cpp currently doesn't have this contextual information of the requested robots.txt URL, so it currently can't determine which directives are ignorable or not.

dlecocq commented 7 years ago

Man, the internet is a strange place.

Your interpretation seems reasonable to me.