robots.txt contradicts sitemap.xml.gz for urls containing the string "search"

plone / Products.CMFPlone

The core of the Plone content management system

https://plone.org

GNU General Public License v2.0

254 stars 191 forks source link

robots.txt contradicts sitemap.xml.gz for urls containing the string "search" #1994

Closed CharString closed 7 years ago

CharString commented 7 years ago

BUG/PROBLEM REPORT (OR OTHER COMMON ISSUE)

What I did:

Installed Plone 5.0.6 and created a folder called Research

What I expect to happen:

I expected Google would index the contents.

What actually happened:

Google search console threw an error: that the urls in that folder were in the sitemap.xml.gz, but was Disallowed by robots.txt.

What version of Plone/ Addons I am using:

Plone 5.0.6

See https://github.com/plone/Products.CMFPlone/commit/03a7670544add6c889ce72391f71f4775929418e

hvelarde commented 7 years ago

@CharString as mentioned in the commit, we're talking about 2 different issues here:

we need to block robots from the search and search result pages and we need to find out the best way to do it without adding the /*search line in robots.txt as this is causing problems (maybe using a combination of the X-Robots-Tag HTTP response header and the nofollow value on listing links)
we need to review why sitemap.xml.gz is pointing to the view instead of using the canonical URL

more information here: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

I can fix those things after the @plone/framework-team give me some clues on how to proceed.

CharString commented 7 years ago

@hvelarde I only mentioned the first in this issue description, and I've created a separate issue in the plone.app.layout for the views thing: https://github.com/plone/plone.app.layout/issues/117

CharString commented 7 years ago

The simplest possible solution would be to replace the /*search line with 2 lines:

Disallow: /search
Disallow: /*@@search

That would correct the (first) issue at hand with Google (and possibly others that allow * syntax. The X-Robots-Tag (or a <meta>-tag in the <head> of the search templates) would be a completer solution, that expands the functionality of blocking search result pages to crawlers that implement the standard robots.txt (without *).