plone / Products.CMFPlone

The core of the Plone content management system
https://plone.org
GNU General Public License v2.0
254 stars 191 forks source link

robots.txt contradicts sitemap.xml.gz for urls containing the string "search" #1994

Closed CharString closed 7 years ago

CharString commented 7 years ago

BUG/PROBLEM REPORT (OR OTHER COMMON ISSUE)

What I did:

Installed Plone 5.0.6 and created a folder called Research

What I expect to happen:

I expected Google would index the contents.

What actually happened:

Google search console threw an error: that the urls in that folder were in the sitemap.xml.gz, but was Disallowed by robots.txt.

What version of Plone/ Addons I am using:

Plone 5.0.6

See https://github.com/plone/Products.CMFPlone/commit/03a7670544add6c889ce72391f71f4775929418e

hvelarde commented 7 years ago

@CharString as mentioned in the commit, we're talking about 2 different issues here:

more information here: https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag

I can fix those things after the @plone/framework-team give me some clues on how to proceed.

CharString commented 7 years ago

@hvelarde I only mentioned the first in this issue description, and I've created a separate issue in the plone.app.layout for the views thing: https://github.com/plone/plone.app.layout/issues/117

CharString commented 7 years ago

The simplest possible solution would be to replace the /*search line with 2 lines:

Disallow: /search
Disallow: /*@@search

That would correct the (first) issue at hand with Google (and possibly others that allow * syntax. The X-Robots-Tag (or a <meta>-tag in the <head> of the search templates) would be a completer solution, that expands the functionality of blocking search result pages to crawlers that implement the standard robots.txt (without *).