seomoz / reppy

Modern robots.txt Parser for Python
MIT License
187 stars 40 forks source link

user-agent matching #37

Open nlevitt opened 7 years ago

nlevitt commented 7 years ago

There doesn't seem to be a rock solid convention on how to match user-agent strings. However the various standards agree on recommending case-insensitive substring match. See below.

Reppy does something different (exact match?). Which means someone writing a web crawler likely has to manage two different kinds of user-agent string, one for robots.txt and a different that's actually sent in the requests. (That seems to be what google's crawlers do, but ugh, why. https://support.google.com/webmasters/answer/1061943)

References:

http://stackoverflow.com/questions/18026551/is-the-user-agent-line-in-robots-txt-an-exact-match-or-a-substring-match

In the original robots.txt specification (from 1994), it says:

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

http://www.robotstxt.org/norobots-rfc.txt

The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive.

https://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1

There must be exactly one "User-agent" field per record. The robot should be liberal in interpreting this field. A case-insensitive substring match of the name without version information is recommended.

dlecocq commented 7 years ago

The intention is as you describe. This regression was introduced in the switch to the C++ extension version of reppy (0.4.0), and a bug fix will be pushed shortly to rep-cpp and subsequently reppy.

nlevitt commented 7 years ago

Ok, well, I appreciate the quick response and bug fix. However this change misses the main point of my feature request, which is substring matching.

Here is a test I would like to see pass:

diff --git a/tests/test_robots.py b/tests/test_robots.py
index fd0543329b..c6e8f5b3cc 100644
--- a/tests/test_robots.py
+++ b/tests/test_robots.py
@@ -142,6 +142,18 @@ class RobotsTest(unittest.TestCase):
         self.assertFalse(robot.allowed('/path', 'agent'))
         self.assertFalse(robot.allowed('/path', 'aGeNt'))

+    def test_substring(self):
+        '''
+        "The robot must obey the first record in /robots.txt that contains a
+        User-Agent line whose value contains the name token of the robot as a
+        substring." -- http://www.robotstxt.org/norobots-rfc.txt
+        '''
+        robot = robots.Robots.parse('http://example.com/robots.txt', '''
+            User-agent: Agent
+            Disallow: /path
+        ''')
+        self.assertFalse(robot.allowed('/path', 'foo; agent; bar'))
+
     def test_empty(self):
         '''Makes sure we can parse an empty robots.txt'''
         robot = robots.Robots.parse('http://example.com/robots.txt', '')
dlecocq commented 7 years ago

Ah, I apparently completely glossed over that part.

I agree with the two case-insensitive matching cases, but am more on the fence about the case of substring matches. That said, that appears to be from the resource we trust the most.

Sounds like there are two approaches here -- substring checks as described, or the ability to provide multiple agents and see if any match before using the default. I don't believe it's the intention of someone crafting a robots.txt:

# Should not match my-agent, foo-agent, etc.
User-agent: agent
Disallow: /

This is the first time I've encountered the suggestion / interpretation of wanting to do substring matches. On its face it seems like an unnecessary burden on clients to do a O(agents) check in order to do the matching.

This would involve a bit of refactoring, but is not conceptually hard.

nlevitt commented 7 years ago
# Should not match my-agent, foo-agent, etc.
User-agent: agent
Disallow: /

I agree the intention of the person writing robots.txt might not be for "my-agent" etc to match here. Maybe some kind of whole-word matching would be theoretically preferable. However that's not what the spec says, and it would be more difficult to implement.

It's true that substring matching is O(n). But even for the craziest edge case robots.txt, n will be in the hundreds, maybe the thousands. String searches are fast. I would be very surprised if this becomes a performance issue in any real world use cases.

dlecocq commented 7 years ago

I agree that in many practical cases it's probably unimportant performance-wise. For us, we're only interested in a single set of rules, and so we use AgentCache to cache only the relevant stanza, so we pay any O(n) lookup time just once.

nlevitt commented 7 years ago

It turned out to be pretty easy to monkey-patch reppy. https://github.com/internetarchive/brozzler/commit/3aead6de93 I'm ok with this.

b4hand commented 7 years ago

Note the reference as quoted before says the following:

A case insensitive substring match of the name without version information is recommended. [emphasis added]

I think this is truly only the version portion of the agent that is supposed to be dropped for substring matching rather than arbitrary substring matching. Also, in general, with RFCs it's generally the case that actual practice takes precedence over the document itself, and when implementations differ from the RFC, typically, the RFC is updated. There's sufficient evidence that robots in practice don't do sub-token level substring matching for robots declarations.

nlevitt commented 7 years ago

I wouldn't read it that way. If anything it probably means to remove the version from the token in robots.txt (i.e. "Googlebot/2.1" => "Googlebot").

If this is your user-agent (Google Smartphone https://support.google.com/webmasters/answer/1061943)

Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

then if you remove version information what would it be?

Mozilla (Linux; Android; Nexus 5X Build) AppleWebKit (KHTML, like Gecko) Chrome Mobile Safari (compatible; Googlebot; +http://www.google.com/bot.html)

And then look for an exact match on that in robots.txt?? Seems crazy to me