google / robotstxt

The repository contains Google's robots.txt parser and matcher as a C++ library (compliant to C++11).
Apache License 2.0
3.39k stars 230 forks source link

Allow wider range of chars for valid user-agent identifier / 'product' token #46

Closed jimsmart closed 2 years ago

jimsmart commented 2 years ago

Hi,

I've just fixed an issue reported against my Go port of this library — according to the specs used (by the Google library here), valid chars for a user-agent string are "a-zA-Z_-", but RFC7231 defines the 'product' part of user-agent as being a 'token', defined as:

  token          = 1*tchar

  tchar          = "!" / "#" / "$" / "%" / "&" / "'" / "*"
                 / "+" / "-" / "." / "^" / "_" / "`" / "|" / "~" 
                 / DIGIT / ALPHA
                 ; any VCHAR, except delimiters

I think it's important to fix this, because, according to Wikipedia's robots.txt, there are bots in the wild that are using user-agent strings with characters outside of the characters permitted by the current RobotsMatcher::ExtractUserAgent implementation - which means that 'disallow' directives that would otherwise match are in fact failing to match. (Examples include: MJ12bot and k2spider)

See https://github.com/jimsmart/grobotstxt/issues/4 for further details.

HTH

garyillyes commented 2 years ago

Hi Jim,

I'm closing this issue as 'working as intended' for the following reason: a crawler's non-compliance with a (de-facto) standard is not a reason to modify said standard and the parsers that are relying on it. The fault lies with the crawlers and if they wanted to follow the standard, they could look at the definition of the user-agent token of the REP at https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.2.1

jimsmart commented 2 years ago

Hi Gary,

With respect: I believe this issue has been closed somewhat hastily, and perhaps in error, and I ask you to reconsider.

1) As per Postel's Law: Be conservative in what you do, be liberal in what you accept from others.

2) RFC7231 (HTTP 1.1) defines the user-agent token (in general, sure: not specifically for crawlers/bots) as permitting a 'product' part with a wider range of characters than the draft spec you link. For reference, see:

https://httpwg.org/specs/rfc7231.html#header.user-agent

3) People are already creating, and indeed using, bots that have user-agent identifiers that utilise this (arguably very valid: i.e. permitted by web servers, according to the HTTP 1.1 spec/RFC) wider definition of what is permitted as a user-agent string — as evidenced by disallow lines in Wikipedia's robots.txt, and also evidenced by the issue reported against my port of this library. Sure: it's their bad: they've looked at the HTTP spec, and do not have enough experience in the domain and/or have not done enough research to know about de facto standards.

4) Folk who do this (i.e. name their crawler with an otherwise/seemingly valid user-agent string, according to the HTTP 1.1 spec), but who are trying to do the right thing (by using a robots.txt parser library, albeit with a narrow definition of user-agent strings) will in fact find that their crawler erroneously accesses parts of a site which it should not. It will be badly behaved. They likely won't know why, they may not even realise at all, and in many cases we (developers/webmasters) can't talk to them to educate them otherwise.

5) Making this change to your library breaks nothing in the real-world, nor theoretically: it in fact leads to less problems/errors, and less badly behaved crawlers — and it only requires a single line in the test suite to be amended — and such a change would be seen as developer/user-friendly thing to do (see Postel's Law, above).

I think strongly sticking to a draft that is only de-facto, but not 100% in alignment with the HTTP spec re user-agent strings, in 2021 — when more and more folk are automating web stuff — is perhaps a hasty call, if not, arguably, a little short-sighted. I ask you to reconsider closing this issue — for the greater good, not anything personal.

Best regards.

jimsmart commented 2 years ago

6) Despite my issue only being open for about an hour, it was thumbed-up by someone else who undoubtedly has some knowledge and experience in the field — @fili / https://github.com/fili

I think this issue should be opened again — even if only for further discussion / review.

garyillyes commented 2 years ago

I did close this hastily, but having have worked on and used the standard for well over a decade I stand by my action nonetheless.

We're in the process of standardizing the de-facto standard based on how it was defined at inception in the early 90's. The plain fact that this protocol has been used for over 20 years as is makes it very difficult to make changes like what you suggest.

It is unfortunate that the REP uses the term "user-agent" as a field name for identifying specific clients, basically overloading it, since, as you pointed out, the HTTP spec uses this very same term for identifying clients. However this term choice was made 20+ years ago, again, making it virtually impossible to change. The two "user-agents" are different, because the needs of the two standards for this field are different. The definition of the field in the HTTP spec allows for extremely granular user-agent strings (which I personally don't understand, but that's my problem) for reasons, while the robots.txt one is optimized for absolute simplicity for users (i.e. webmasters), expecting crawler owners to be able to follow simple terms of a pretty clear and blunt standard. If the crawler owners choose not to follow the standard, that simply means that they're not following the robots.txt standard. But a well established standard shouldn't ever need to change for outliers.

I don't know these crawlers, but it would be more sensible to reach out to them vs. changing a parser to accommodate them.

Basically, the HTTP user-agent and robots.txt user-agent field are not the same thing, they should not be the same thing, and I stand by the closure of this issue

fili commented 2 years ago

Just elaborate and give my 2cts:

I think @jimsmart raised a valid point and warrants consideration.

For example, mj12bot has been around for many many years and to suddenly require them to change their UA string because of the updated specs, as well as all the robots.txt files across the thousands of origins currently specifying disallow rules for mj12bot, makes no sense. And as @jimsmart points out, there are more historic crawlers/bots which have digits in their naming, as can be checked in HTTParchive, like 008, ltx71, spinn3r, crawler4j and 360spider, etc - just to name a few.

I understand why the specs are limiting in nature, and do not comply with the full UA definition and agree with @garyillyes not to include many more characters (like forward slashes or dots). The specs aim for a shorter and general version of the long UA string, and not for different versions of the same bot. Nor is the HTTP UA string and robots.txt UA string the same, as @garyillyes correctly mentions.

However, a small tweak by allowing digits in the match string can probably solve the majority of the issue raised here and makes the parser more useful/robust without requiring thousands of websites to update their robots.txt rules and the historic crawlers (still active) to update their UA:

My suggestion is to change "a-zA-Z_-" to "a-zA-Z0-9_-"

Hence my thumbs-up. Happy to discuss further, as @jimsmart also mentioned.

fili commented 2 years ago

@garyillyes

I don't know these crawlers

mj12bot is the Majestic crawler, and I agree with you to not update the parser for a single bot, there are more bots like this active and many webmasters on thousands of origins have implemented disallow/allow rules for these bots with digits in their name (see previous link to HTTParchive) so allowing for digits as well is adapting the parser/specs to the real-world situation and historic usage.

jimsmart commented 2 years ago

Huh? It's not overloaded at all.

The user-agent HTTP header denotes the identifier of the software accessing the website / making the HTTP request, whether a web-browser or a web-crawler. That is the agent here. Oh sure, it's not a user per se. But this is the de facto.

The fact is that many folk will simply look to the HTTP spec with regards to the definition/format of this field.

In the long run, I believe that the sensible change is to bring the de facto robots spec in line with the HTTP spec — the only change is permitting a few more characters in the first part of the string.

Granted: the HTTP spec permits a far more granular spec for agent names — But the fact is that everything after the 'product' part (i.e. from the trailing / onwards) is simply disregarded in the world of crawlers anyhow. The base part of the name would still relatively simple, even with the adoption of accepting a more liberal string, as I propose. So this is pretty much a moot point.

Webmasters will simply copy the base part of the user-agent string they see in their logs into the robots.txt file anyhow, they also won't care to look to see whether it is valid or not according to a spec: they'll assume that the crawler in question will handle its 'product' name being in the file, and do the right thing. Except, if using this library, the crawlers won't.

I don't know these crawlers, but it would be more sensible to reach out to them vs. changing a parser to accommodate them.

I don't know them either. Neither do webmasters whose sites that encounter them. Good luck with that!

Basically, the HTTP user-agent and robots.txt user-agent field are not the same thing, they should not be the same thing

Really? I don't particularly agree with this claim. User-Agent is the HTTP header that identifies the requesting software agent, and that's the best we've got to go on: webmaster, crawler developer, or otherwise. I believe the robots spec to be be a subset of the greater HTTP spec, in as much as it only allows a single 'product' part.

But because of the fact that the de facto specs surrounding crawlers are precisely that, de facto, and not RFCs, and have been for years, it is easy (and not unexpected) for a developer to simply believe that the HTTP spec is 'truth' here.

Over time, in all decades I've been coding web stuff, I've seen various specs aligning themselves over points like this. Sure, it takes time, but it makes sense to do so. (Recent examples of this are the various RDF serialisation specs, and their permitted/disallowed punctuation and UTF-8 chars, but I digress)

But anyhow: all specs aside and be damned — this library already accepts all manner of miss-spellings for fields in the robots.txt file. That is to say: it is already very liberal in what it accepts, above and beyond any spec, de facto or RFC. But a more liberal interpretation of the user-agent, allowing a few more characters, and fully inline with the HTTP spec, is bad mojo though? Really? In all honesty: that doesn't seem like a particularly strong point to argue, nor position to argue from.

— Apologies if how I express my strong views upsets anyone. (I've probably drunk way too much coffee today already) I've said my bit, and don't intend to bang on.

fili commented 2 years ago

As @jimsmart mentions, in the spirit of handling misspellings while avoiding to update the specs currently in draft, how about a small code change is made in the parser to handle common and historic exceptions?

Change this:

while (absl::ascii_isalpha(*end) || *end == '-' || *end == '_') {
    ++end;
}

to:

while (absl::ascii_isalnum(*end) || *end == '-' || *end == '_') {
    ++end;
}

With this code change we make the parser more robust and adapt to real-world and historic use cases while the specs are still being finalized and bots are still adapting to the drafted specs.

Alternatively, we can start the discussion if the specs needs changing since they are still a draft and not finalized or set in stone, now may be good time to do this.

@jimsmart if going ahead with the code change, we do need a updated unit test and comment (testing several variations and explaining the misspelling and why it deviates from the drafted specs). Is this something you can provide?

fili commented 2 years ago

FYI: Another common crawler with digits is from a Chinese search engine: 360spider and comes with several additional flavors: 360spider-image and 360spider-video.

In the HTTPArchive we can find thousands of robots.txt files with references to this legitimate search engine crawler. Webmasters are using digits in the UA string for the bot name in their robots.txt files.

Based on this and the mj12bot historic cases as examples, adapting the specs may be more desirable given that the related robots.txt rules has been around for quite some time with webmasters.

fili commented 2 years ago

A quick Google search returned a few examples of websites which have disallow rules in their robots.txt files for User-Agent with digits, like 360spider, k2spider and/or mj12bot:

fili commented 2 years ago

To highlight to top bots as found in the sample mentioned:

Row client user_agent count  
1 mobile mj12bot 420256  
2 desktop mj12bot 351546  
3 mobile 008 38334  
4 desktop 008 32981 
5 mobile 360spider 18326 
6 desktop 360spider 14544

To confirm the thousands of current origins from the Chrome User Experience Report sample of the web (crawled by the HTTParchive last summer), just run this query on the freely accessible HTTParchive dataset in BigQuery:

CREATE TEMPORARY FUNCTION getRobotsTextUserAgents(robots_txt_string STRING)
RETURNS STRUCT<
  user_agents ARRAY<STRING>
> LANGUAGE js AS '''
var result = {
  user_agents: []
};
try {
    var robots_txt = JSON.parse(robots_txt_string);
    if (Array.isArray(robots_txt) || typeof robots_txt != 'object') return result;
    if (robots_txt.user_agents) {
      var uas = robots_txt.user_agents.map(ua => ua.toLowerCase());
      var uas =  uas.filter(function(item, pos) { return uas.indexOf(item) == pos;}); // remove duplicates
      result.user_agents = uas;
    }
} catch (e) {}
return result;
''';

SELECT
  client,
  user_agent,
  COUNT(0) AS count
FROM
  (
    SELECT
      _TABLE_SUFFIX AS client,
      total,
      getRobotsTextUserAgents(JSON_EXTRACT_SCALAR(payload, '$._robots_txt')) AS robots_txt_user_agent_info
    FROM
      `httparchive.pages.2021_07_01_*`
    JOIN
      (
        SELECT _TABLE_SUFFIX, COUNT(0) AS total
        FROM
          `httparchive.pages.2021_07_01_*`
        GROUP BY _TABLE_SUFFIX
      )
    USING (_TABLE_SUFFIX)
  ),
  UNNEST(robots_txt_user_agent_info.user_agents) AS user_agent
GROUP BY
  total,
  user_agent,
  client
HAVING
  count >= 20
  AND REGEXP_CONTAINS(user_agent, '[0-9]')
  AND NOT REGEXP_CONTAINS(user_agent, r'[\.\/\s]')
ORDER BY
  count DESC

A few important notes:

fili commented 2 years ago

Another bot which respects robots.txt but has digits in the name for the User-Agent identification is the W3C Link Checker. So also the W3C uses digits in the bot name.

@garyillyes Should the draft for the specs not be updated?

Specifically, the following line in the specs:

   Crawlers set a product token to find relevant groups.  The product
   token MUST contain only "a-zA-Z_-" characters. 

update to:

   Crawlers set a product token to find relevant groups.  The product
   token MUST contain only "a-zA-Z0-9_-" characters. 

Adding @ekharvey and @hzeller and @makuk66 for their opinion as co-authors on the specs and whether the historic (more than 10 years of legitimate other online services like a search engine competitor, W3C, etc) and current real world implementation of digits in User-Agent string (as used by actual webmasters, see examples above) warrant updating the specs currently in draft to reflect this.

christophcemper commented 2 years ago

a crawler's non-compliance with a (de-facto) standard is not a reason to modify said standard and the parsers that are relying on it.

I would agree, if the (de-facto) standard would be "a-zA-Z_-".

But as was shown the (de-facto) standard on the web and even existing code seems to be "a-zA-Z0-9_-"

As a SEO practioner it also never crossed my mind, that bot identifiers would only be restricted to "a-zA-Z_-" when working with hundreds robots.txt with said Majestic bot being in there many times.

The spec is outdated and needs to be adopted to reality, not the other way round imo.