Query parameter expressions in disallow rules are ignored for page meta tags and headers

sarayourfriend commented 1 month ago

Due to the following code, disallow rules that include query parameter expressions have no effect on the meta tags and headers for pages they target:

https://github.com/nuxt-modules/robots/blob/main/src/runtime/nitro/kit.ts#L24C40-L24C52

As an example, the rule Disallow: /foo?bar=* will inject no-index meta tags and headers for any page matching the path prefix /foo, instead of only doing so when the bar query parameter is present.

I ran into this while working on an issue for WordPress/Openverse: https://github.com/WordPress/openverse/pull/4735

As discussed in Discord, I'll work on a PR for this over the weekend. @harlan-zw any guesses as to whether this will require more than just removing the withoutQuery call in the linked code?

sarayourfriend commented 1 month ago

@harlan-zw I've got a small changeset in https://github.com/sarayourfriend/nuxtjs-robots/commits/fix/query-param-rules/ that reproduces the issue in unit tests. I took a bit of a hacky approach to just to make sure I could do it, but I'll plan to add a new fixture purpose made for this rather than reusing the i18n one.

Anyway, after looking a bit into how the route matching works, I think this will be much more complex to solve than I had hoped, and actually has further reach than just query parameters, and affects all wildcard matching. To make sure my evaluation of the issue is coherent, I've summarised my understanding of the module's methods for route matching and meta tag/header injection based on that matching. Please correct me if I've missed anything or am going off track. I've suggested a potential course of action that I'll get a draft PR up for shortly, and hopefully your knowledge of SEO (and in particular how robots.txt and robots meta tags/headers should relate to each other) will help clarify where priorities should be when deciding on a potential plan to implement support for query parameter expressions in robots rules.

How robots headers and meta tags are set

I'll ignore the details about groupings and focus exclusively on how the module decides if a given path matches rules when determining the robots meta tag and headers to inject into the response.

At the broadest level, the module injects a middleware that tests the path for each request (skipping certain path prefixes). The middleware injects the headers for the response based on the rule matched for the path, and adds the robots config for the request to the request event context.

https://github.com/sarayourfriend/nuxtjs-robots/blob/fix/query-param-rules/src/runtime/nitro/server/middleware.ts#L4-L10

That request event context is in turn used in a Nuxt plugin:

https://github.com/sarayourfriend/nuxtjs-robots/blob/fix/query-param-rules/src/runtime/nuxt/plugins/robot-meta.server.ts#L7-L24

However, that plugin is scoped only to the server. CSR page transitions to not get updated meta tag values. I assume that's based on crawlers only getting SSR treatment because they don't execute JavaScript (presumably).

How a request is matched

The header and meta tag routine depends on an isolated function to determine the robots context for the request, getPathRobotConfig. This functions has to methods for matching paths:

For routes defined in the nitro routeRules, it relies on createNitroRouteRuleMatcher to actually match paths to route rules that can be configured as indexable or not using nitro router's path matching syntax. This does not support query strings and there's no reason to expect it should considering routeRules does not. My original assumption that https://github.com/nuxt-modules/robots/blob/main/src/runtime/nitro/kit.ts#L24C40-L24C52 was the problem is incorrect, because that code is only for route rules, and is irrelevant for robots.txt rules defined in the disallow setting for any group in the robots module's configuration.
Instead, the robots module own configuration is relevant for indexableFromGroup, which is the function that actually does the work to match robots.txt style rules defined in the module configuration's disallow lists. I believe this function is the source of the problem with wildcard matches and also where we would need to implement support for query string expressions.

Wildcard matching in `disallow`/`allow` rules

indexableFromGroup does not correctly interpret valid robots.txt route expressions, with these two tests being the specific source of the problem:

https://github.com/sarayourfriend/nuxtjs-robots/blob/fix/query-param-rules/src/runtime/util.ts#L128-L132

Copied here for posterity:

    const hasDisallowRule = asArray(group.disallow)
      // filter out empty rule
      .filter(rule => Boolean(rule))
      .some((rule: string) => path.startsWith(rule))
    const hasAllowRule = asArray(group.allow).some((rule: string) => path.startsWith(rule))

It also only takes into account the path, and doesn't check the user agent on the request (though whether that's also a problem from an SEO perspective, I'm not sure). The problem is the paths are matched solely on whether the path is a prefix of the rule. Given the following configuration:

{
  disallow: [
    '/*/hidden',
    '/users/*/hidden',
    '/?a=',
    '/visible?*a=
  ],
}

None of the following would be treated as non-indexable in the meta tag or headers, despite the robots.txt wildcards matching zero or more characters,

`/about/hidden'
`/users/sara/hidden'
`/users/sydeny/megan/hidden'
'/?a=value'
'/visible?b=one&a=two'

Google's documentation describes that clearly: https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt?csw=1#url-matching-based-on-path-values

To double check my understanding, I used Google's open source robots.txt parser to verify that the above rules in an robots.txt would disallow any of those inputs.

Potential solutions

My assumption is that the module is intended to support any valid robots.txt path expressions in the disallow and allow config. Fully supporting robots.txt path expressions for the header and meta tag injection makes sense to me given the module's purpose to "manage the robots crawling your site with minimal config and best practice defaults." Therefore, the main idea I have to support these expressions is to either (a) find a suitable third-party dependency that handles this; or (b) develop that as part of this module.

I've taken a quick look at the available modules for JS parsing of robots.txt, and robots-parser looks to be the best one at the moment: https://github.com/samclarke/robots-parser. It has no external dependencies and a large suite of tests. I tried it with the examples I shared above and it matches Google's robots.txt parser's results exactly, so appears to be correct. Its API requires a host, not just a path, so it wouldn't fit directly into indexableFromGroup as-is, but should be easy to make work correctly.

Alternatively, I could see an argument where the module should not auto-magically support wildcard and query parameter expressions when determining the headers and meta tags to use, and perhaps indeed only support only the simplest prefix-based rules. Mainly I can see this being the case if robots.txt is not usually 1-to-1 related to the robots headers and meta. They can serve somewhat separate functions and expectations, like if robots.txt explicitly allows certain paths so crawlers can see the page while serving context dependent instructions in the meta tag. We've run into this on Openverse, for example, where in the end the correct solution was to exclude the routes I originally ran into this issue with from robots.txt because we do want crawlers to land on them and index them, we just needed to add nofollow to prevent the specifically unwanted crawler behaviour. Disallowing those pages in our robots.txt would stop them altogether, and could go against the modules intention of "best practice defaults", though I'm not sure what the best practices are, personally.

Anyway, this is a lot of context just to make sure I've got my understanding of how things work and where the problem is. I do not know specifically which solution you'd prefer: full support for robots.txt expressions in disallow/allow and use those to generated headers and meta tags?; or, only support simple rules in that way and instead document and encourage users to think carefully about their intentions and develop their own context-specific usage (middleware, page components, whatever) to determine the right headers and meta tag directives. Let me know what you think :grin:

Edit: After looking at the module code again, I wonder if using robots-parser to determine whether a request matches the rules could actually replace manual parsing of the robots.txt and rule expressions in the config altogether. Could the module generate the robots.txt based on the config inputs, and then pass the generated (and combined) robots.txt to robots-parser, retain a singleton instance of the parser in the nitro context, and make determinations about each path that way? Apologies if I am missing other reasons why the rules need to be parsed and maintained as they are now, rather than as a robots.txt only.

harlan-zw commented 4 weeks ago

Hey, legend write-up. Thanks for the investigation and all the details.

I have a PR here to try and resolve this: https://github.com/nuxt-modules/robots/pull/136

The meta tag is useful as it can give you finer control over how crawlers index the page that robots.txt can't offer and allows using regex matching.

However, in the eyes of crawlers, robots.txt is the ultimate source of truth. A developer shouldn't expect overriding a robots rule to modify the robots.txt, so we need to go in reverse, we must apply the robots.txt to keep parity with how crawlers are behaving.

In theory, we could just omit the tag if it's not being overridden or the page is not indexable, this would be the simplest solution. However to figure out this logic we still need to apply the robots.txt rules, so since we're parsing them anyway, we should just implement the tags for the assurance to developers.

This does get more complicated with applying user agent rules so we omit this logic for now.

There are a number of questions I haven't answered from your reply, let me know if you want clarification on any specific ones of them.

sarayourfriend commented 4 weeks ago

All good, PR looks great! I'll try it out in WordPress/Openverse later this week, thanks very much for getting something up!

However, in the eyes of crawlers, robots.txt is the ultimate source of truth. A developer shouldn't expect overriding a robots rule to modify the robots.txt, so we need to go in reverse, we must apply the robots.txt to keep parity with how crawlers are behaving.

Makes sense! The cases I've been worried about/trying to understand how best to handle are pages we want indexed but not followed, or especially for bots to feel free to land on for e.g. OpenGraph tags/oembed data, etc, without necessarily crawling. And those query string matches in robots will be crucial for avoiding indexers making unnecessary requests for pages that they shouldn't index at all (using up our search resources, for example).

The directionality you've described makes sense and gives the flexibility, nice!

Anyway, thanks for getting a PR up, definitely simpler than the idea I had :slightly_smiling_face:

harlan-zw commented 4 weeks ago

Okay, available in 4.0.3, hopefully all good 🤞

Let me know if you have any more issues or questions.

nuxt-modules / robots